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Preface 


The warning could not have been meant for the place 
where it could only be found after approach. 

—Joseph Conrad, Heart of Darkness 


This solution manual to Bayesian Essentials with R covers all the exer¬ 
cises contained in the book, with a large overlap with the solution manual 
of the previous edition, Bayesian Core, since many exercises are common to 
both editions. These solutions were written by the authors themselves and 
are hopefully correct, although there is a non-zero probability of typos and 
errors! Although we only noticed two difficulties in the text of the exercises 
(Exercises 7.11 and 7.18), there may also be remaining typos at that stage, 
so encourage the readers to contact us in case of suspicious wordings. 

The earlier warnings attached with the solution manual of Bayesian Core 
apply as well to this solution manual: some of our self-study readers may 
come to the conclusion that these solutions are too sketchy for them because 
the way we wrote those solutions assumes some minimal familiarity with the 
maths, the probability theory, and the statistics behind the arguments. There 
is unfortunately a limit to the time and to the efforts we can put in this 
solution manual and studying Bayesian Essentials with R does require some 
prerequisites in maths (such as matrix algebra and Riemann integrals), and 
in probability theory (such as the use of joint and conditional densities), as 
well as some bases of statistics (such as the notions of inference, sufficiency, 
and confidence sets) that we cannot usefully summarise here. Instead, we sug¬ 
gest Casella and Berger (2001) as a fairly detailed reference in case a reader 
is lost with the “basic” concepts or our sketchy math derivations. Indeed, 
we realised after publishing Bayesian Core that describing our book as “self- 
contained” was a dangerous label as readers were naturally inclined to relate 
this qualification to their current state of knowledge, a bias resulting in inap¬ 
propriate expectations. (For instance, some students unfortunately came to 
one of my short courses with no previous exposure to standard distributions 



VI 


like the t or the gamma distributions, and a deep reluctance to read Greek 
letters.) 

We obviously welcome comments and questions on possibly erroneous so¬ 
lutions, as well as suggestions for more elegant or more complete solutions: 
since this manual is distributed both freely and independently from the book, 
it can easily be updated and corrected [almost] in real time! Note however 
that the R codes given in the following solution pages are far from optimal or 
elegant because we prefer to use simple and understandable R codes, rather 
than condensed and efficient ones, both for time constraints and for pedagogi¬ 
cal purposes: the readers must be able to grasp the meaning of the R code with 
a minimum of effort since R programming is not supposed to be an obligatory 
entry to the book. In this respect, using R replaces the pseudo-code found in 
other books since it can be implemented as such but does not restrict under¬ 
standing. Therefore, if you find better [meaning, more efRcient/faster] codes 
than those provided along those pages, we would be glad to hear from you, 
but that does not mean that we will automatically substitute your R code for 
the current one, because readability is also an important factor. 

Sceaux & Montpellier, Prance, March 17, 2015 
Christian P. Robert & Jean-Michel Marin 
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Normal Models 


2.1 Show that, if 

^|cr^ ~ cr^/A^), (T^ ~ J^^(A^/2, a/2), 

then 

/i ~ a/A^A^) 

a t distribution with Ao- degrees of freedom, location parameter ^ and scale pa¬ 
rameter a/Xf^Xa- 

The marginal distribution of /i has for density-using t = cr^ as a shortcut 
notation- 




,Xa,^,a)(x j r ^ exp {-a/2T} dr 


T-^-/2-3/2exp 


2r 


dr 


oc {X^in-0^ + a} 




-(A„+r)/2 

-(A,,-i-r)/2 


which corresponds to the density of a 3X[Xtji^,a/X^Xa) distribution. 

2.2 Show that, if ~ .y'i^(a,/3), then E[(t^] = /3/(a — 1). Derive from the 
density of that the mode is located in /3/(a -(-1). 


Once again, use t = cr^ as a shortcut notation. Then 
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2 Normal Models 


poo oa 

E[a‘^]= r-— r"“"^exp{-^/r}dT 

Jo r{a} 

roo ooi 

= / T-“-—T-“-iexp{-/3/T}dT 

Jo r{a) 

/ 3 “ r{a-l) 

~ /3“-i r(a) 

= /3/(a - 1). 


2.3 Show that minimizing (in 0(^„)) the posterior expectation E[| \9 — 6\ P|^n] 
produces the posterior expectation as the solution in 0. 


Since 


E[L(0,0))|^„] =E[||0-0|n^] 

= E[( 0 - 0 y( 0 - 0 )i^„] 

= E[||0||2-20T0+||0||2|^„] 

= E[||0|n^„]-20TE[0|^„] + ||0||2 
= E[||0|n^„] - ||E[0|^„]|p + ||E[0|^„] - ef , 


minimising E[L(0,0))|^„] is equivalent to minimising | |E[0|^„]—0| p and hence 
the solution is 


9 = E[9\9n] ■ 


2.4 Show that the Fisher information matrix on 0 = (p., cr^) for the normal 
^(p,a^) distribution is given by 


/^(0) =Ee 


/ 2(a; —/r)/2(7^ \ 

\^2(x — /r)/2cr^ — x)^ja^ — \l2a^j 


(Ija^ 0 \ 
0 l/2a‘^) 


and deduce that Jeffreys' prior is 7r'^(0) oc l/cr^. 


The log-density of the normal ^{p, cr^) distribution is given by 


log ip(x;fx,a^) 


log(27rcr^) 


{x- nf 

rr2 


Hence, 
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E 

E 

E 


'd'^ log ip{x;p,a'^y 

= E 

dp^ 

'd^ log If {x;p,a^y 

= E 

dpda^ 

'd^ logf{x;p,a^y 

= E 

da^ 


(x- n) 


= 0 


1 _ jx- 

2tT'^ (T® 


2a^ CT® 


The corresponding Fisher information matrix 


I^{9) 


(1/a^ 0 \ 

0 l/2a^J 


1 


has the associated determinant det(/'^(6>)) = which does lead to 

7r'^(0) oc det(/'^(0))^'^^ oc 1/cr^ . 


2.5 Derive each line of Table 2.1 by an application of Bayes’ formula, 'k{0\x) oc 
'n{9)f{x\9), and the identification of the standard distributions. 

For the normal distribution 'P{9^a^), 

f{x\9) X tt{9\h, t) = if{a-^{x - 9})ip{T-'^{9 - ^}) 

oc exp {9^[a~^ + t“^] — 29[a~^x + 

oc exp {9^fpr'^a'^ — 29[t^x + cr^/ijp/pr^cr^} 

oc tp ([d - p{t^x + crV)] ra^ 

For the Poisson distribution V{9)^ 

f{x\9) X 7r(0|a,/3) oc ^ = 0-+“-ie-(/3+i)e 

which is proportional to the Q(a + x,(3 + \) density. 

For the Gamma distribution Q{v,9), 

f{x\9) X 7r(6»|a,^) oc 9''x''-^ oc 6»«+'^-ie-(/3+^)» 

which is proportional to the Q(a + v, (3 + x) density. 

For the Binomial distribution B{n,9), 

f{x\9) X Tr{9\a,P) oc 9^(1 - 9°^-^! - = 6»^+“-i(l - 0)"-^+/3-i 

which is proportional to the B{a + x, /3 + n — x) density. 

For the Negative Binomial distribution JVeg{m,9), 
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2 Normal Models 


f{x\9) X TT{9\a,l3) oc 6»'"(1 - 9^ r-i(l - 

which is proportional to the B{a + m, P + x) density. 

For the multinomial distribution M.{9i,... ,9k) 

k k k 

f{x\9) X ni9\a) (x H H = H 
2 = 1 2=1 2=1 

which is proportional to the I?(ai + Xi,... ,ak + Xk) density. 

For the normal 1/9) distribution, 

f{x\9) X Tr{9\a, /3) oc 0^/^ exp{—0(a:: — ^)^/2}0““^ exp{—/30} 

= 00 . 5 +a-i exp {-(/3 + 0 . 5 (a: - n)^)9} 

which is proportional to the G{ct + 0.5, /3 + 0.5(/r — x)"^) density. 

2.6 A Weibull distribution W{a,p,^) is defined as the power transform of a 
gamma ^{a,P) distribution: If cc ~ W{a,P,^), then x^ ~ ‘^{a,P). Show that, 
when 7 is known, W{a,P,'y) allows for a conjugate family, but that it does not 
an exponential family when 7 is unknown. 

For the first part, if 7 is known, observing x is equivalent to observing x'^, 
hence to be in a 1 ^(q!, /3) model for which a conjugate distribution is available. 
Since the likelihood function is 

£ix\a,P) oc = exp{alog(a:) - px + \og{P°‘/r{a))} , 

a conjugate distribution has a density proportional to 

^^{a,P\^,^J.,X) oc exp{a5-^/r +Alog(/3“/r(a))} , 

with A chosen so that the above function is integrable. 

A Weibull distribution has for density 

since the Jacobian of the change of variables y = x'^ is 'yx'^~^. If we express 
this density as an exponential transform, we get 

f(x\a, P, 7 ) = exp {[(/? + 1)7 - 1] log(a;) - aa;^'} , 

If 7 is unknown, the term x'^a in the exponential part makes it impossible 
to separate parameter from random variable within the exponential. In other 
words, it cannot be an exponential family. 
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2.7 Show that, when the prior on 6 = (/i, a^) is ct^/A^) x J^?^(Acr, a), the 
marginal prior on /r is a Student t distribution T{2Xa,^,a/Xf^Xa) (see Exercise 
2.1 for the definition of a Student t density). Give the corresponding marginal 
prior on a^. For an iid sample = {xi, ■ ■ ■ ,Xn) from tr^), derive the 

parameters of the posterior distribution of (/x, cr^). 

Since the joint prior distribution of (/x, cr^) is 


7 r(/x,CT2) oc (cr^) ^ exp^ {Xf,{fi-^ f+ 2a} 


(given that the Jacobian of the change of variable uj = a ^ is w ^), integrating 


ont cr^ leads to 






which is the proper density of a Student’s t distribution T(2Xa,^,a/X^Xa). 

By definition of the joint prior on (^, cr^), the marginal prior on cr^ is a 
inverse gamma J^'^(Xa,a) distribution. 

The joint posterior distribution of {^J.,cr‘^) is 


7r((/x, cr^)|^) oc (cr^) exp {- [X^[9){^l - £,{9)f + a(^)) /2a'^] , 


with 


Xa{^) — Act + 3/2 + n/2 , 
A^(^) = A^ + n, 

C(^) = (A^^ + mr)/A^(^), 


nX^ 

This is the product of a marginal inverse gamma 

y^{X,{9)-3/2,a{^)/2) 

distribution on cr^ by a conditional normal 

^{a^),ayx,m) 


a{^) = 2a+ ^f4^(x - ^ 
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2 Normal Models 


on /r. (Hence, we do get a conjugate prior.) Integrating out leads to 

poo 

7r{fi\^) cx / exp |— (A^(^)(/i — + a{^)) /2a^} da^ 

poo 

oc / exp {- (A^(^)(/x - e(^))" + a(^)) a;/2} dw 

which is the generic form of a Student’s t distribution. 


2.8 Show that the normalizing constant for a Student distribution 

r((r. + 0)/2)/r(r./2) 


Deduce that the density of the Student t distribution a'^) is 


r(y/ 2 ) 


VU‘‘ 


by 


The normalizing constant of a Student £^{v, distribution is defined 


r{{^ + Q)/2)/r{v/2) r{{v + 0)/2)/T(rz/2) 


ay'l^TT 

T((r. + 0)/2)/T(r//2) 


We have 


and thus 


aWi/TT 


( , / -c2 o ( ^ + y\'^ , {x-yf 

(y-x) +{y-y) =2U- —j + - - - 


J [iy - + iy- yf + S'^] ” d^l 


= 2 " 


= (2a^)- 


where v = 2n — I and 


ac + j/V {x — yY S'^ 


d/i 


x + y 


a^ = 


1 + pi — 


-\2 

x-y\ S 


-| -.^ + 1/2 


d^, 


(2n - 1). 





















2 Normal Models 


7 


Therefore, 


J [(p - xf + ifj-- yf + ”d/x 

OyJvTT 


= ( 2 ^^) 


2\—n 


r{{v + i)/2)/r{v/2) 

flTK 


2 "cr 2 n-lJ-((j, l)/2)/T(z^/2) 

(2n- 


n 2n-l 


(^) +f r((^ + l)/2)/TW2) 


Note that this expression is used later in the simplified derivation of 
without the term (2n—l)^”“^-yi^/2"'_r((i/ + l)/2)/_r(i^/2) because this term 
appears in both the numerator and the denominator. 


2.9 Show that, for location and scale models, the specific noninformative priors 
are special cases of Jeffreys’ generic prior, i.e., that 7r'^(0) = 1 and = 1/9, 

respectively. 


In the case of a location model, f{y\d) = p{y — 6), the Fisher information 
matrix of a location model is given by 


1(9) = Ee 


i91ogp(y — 9)'d log p{Y — 9) 


89 


89 


8p{y - 9)' 

T 

'8p{y-9)' 

89 


89 


'p{y - 9) dy 


8p{zy 

T 

'8p[zy 

8z 


dz 


'p{z) dz 


This matrix is indeed constant in 9. Therefore its determinant is also constant 
in 9 and Jeffreys’ prior on 9 can be chosen as 7r'^(0) = 1 [or any other constant 
provided the parameter space is not compact]. 

In the case of a scale model, if y ~ f{y/9)/9, a change of variable from 
y to z = log(y) [if y > 0] implies that y = log(0) is a location parameter for 
z. Therefore, the Jacobian transform of tt'’{ rj) = 1 is 7r'^(0) = 1/9. When y 
can take both negative and positive values, a transform of y into z = log(|y|) 
leads to the same result. 


2.10 Show that, when tt{9) is a probability density, (2.5) necessarily holds for 
all datasets S>n- 
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Given that 7r(0) is a (true) probability density and that the likelihood 
£{9\^) is also a (true) probability density in ^ that can be interpreted as a 
conditional density, the product 


TT{e)£{e\^) 


is a true joint probability density for (9,^). The above integral therefore 
defines the marginal density of which is always defined. 


2.11 Consider a dataset from the Cauchy distribution, ‘^(/r, 1). 

1. Show that the likelihood function is 


2 = 1 


1 

7^” nr=i(i+ 


m)^)' 


2. Examine whether or not there is a conjugate prior for this problem. (The 
answer is no.) 

3. Introducing a normal prior on fj,, say ^(0,10), show that the posterior dis¬ 
tribution is proportional to 




exp(—^^/20) 


4. Propose a numerical solution for solving 7r(^|^„) = k. {Hint: A simple trape¬ 
zoidal integration can be used: based on a discretization size A, computing 
TT{^\^n) on a regular grid of width A and summing up.) 


1. Since the Cauchy '^(/r, 1) distribution is associated with the density 


fix\9) 


1 

7r{l + {x- 0)2} 


the likelihood ^(/r|^„) is made of the product of the densities. 

2. Given that (.{^,\&n) is the inverse of a polynomial of order 2n, it can¬ 
not be associated with a sufficient statistic of fixed dimension against n. 
Therefore, there is no family of prior distributions parametrised by a fixed 
dimension vector that can operate as a conjugate family. The only formal 
family of conjugate priors is made of densities of the form 


7r(/r) oc 


nr=i(i + (^ 


A)^) 


where m and the m values are arbitrarily chosen. Since this family has 
an unbounded number of parameters, it is of limited modelling interest. 
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3. If ^ ~ A/’(0,10), 7r(^) oc exp{—^^/20}. Hence, 


7r(/r|^„) cx 


exp(—/r^/20) 


4. The question is ambiguous: as stated, there is no need to compute the 
normalising constant. However, the appealing version consists in finding 
an HPD region at a given confidence level a. 

First, we can define the un-normalised posterior as 


> Dn=rcauchy(100) 

> pitilde=function(the,Dn){ 
post=dnorm(the,sd=sqrt(10)) 

for (i in 1:length(Dn)) post=post*dcauchy(Dn[i]-the) 
return(post)} 

where Dn is the sample. To find the normalising constant, the easiest is 
to use integrate: 

> tointegre=function(x){ pitilde(the=x,Dn=Dn) } 

> Z=integrate(f=tointegre,low=-l,up=l)$val 
1.985114e-104 


From there, we need to compute coverages of HPD regions until we hit 
the proper coverage: 

trunpos=function(alpha=.95){ 

levels=max(pitilde(the=seq(-l,1,by=.01),Dn=Dn))*seq(.99,.01,by=-.01) 

cover=0 

indx=l 

while ((cover<alpha)I I(indx<length(indx))){ 
tointegre=function(x){ 

pitilde(the=x,Dn=Dn)*(pitilde(the=x,Dn=Dn)>levels[indx]) } 
cover=integrate(f=tointegre,low=-l,up=l)$val/Z 
indx=indx+l 

} 

return(levels[indx]) 

} 

For our simulated dataset, this results in 

> trunposO 

[1] 1.342565e-104 

> trunpos()/Z 
[1] 0.6763163 


2.12 Show that the limit of the posterior probability ¥^{fi < 0|a;) of (2.7) when 
T goes to oo is <P{—x/a). Show that, when ^ varies in M, the posterior probability 
can take any value between 0 and 1. 
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Since 


P^ti<0\x) = <P{-^{x)/u;) 


= <P 


= <P 


a i + T X a'^ +t‘ 




+ t'^x 


, \/ct^ + t'^V a'^r'^ J 

when ^ is fixed and r goes to oo, the ratio 

+ t'^x 

\/(T^ + T^Vcr^r^ 


goes to 


2 2 
r^x .. x 

iim _ : = lim = — . 

T—^oo -y/^2 _|_ j-2yyQ-2q-2 T —>^oo T (J G 

However, if ^ varies with r, the limit can be anything: simply take ^ = r^/i, 
then 


lim 


cr^r^/r + r'^x 


= lim 


a^fi + x a‘‘fi + x 


°° \/ T-foo -^/cr^ + cr 


2.13 Define a function BaHaJ of the ratio rat when z=mean(shift)/.75 in 
the function BaFa. Deduce from a plot of the function BaRaJ that the Bayes 
factor is always less than one when rat varies. {Note: It is possible to establish 
analytically that the Bayes factor is maximal and equal to 1 for r = 0.) 


Since 


BaFa=function(z,rat){ 

#rat denotes the ratio tau~2/sigma~2 
sqrtCl/(l+rat))*exp(z~2/(2*(l+l/rat)))} 

it is straightforward to define 


BaRaJ=function(rat){ 
BaFa(mean(shift)/.75,rat)} 


and to plot the corresponding curve (Figure 2.1 in this manual). 


2.14 In the application part of Example 2.1 to normaldata, plot the approxi¬ 
mated Bayes factor as a function of t. [Hint: Simulate a single normal c/F(0,1) 
sample and recycle it for all values of r.) 
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Fig. 2.1. Evolution of the Bayes factor as a function of r^/cr^. 


The Bayes factor is given by 




J + itJ- + i-yy + sly] "e 

J - x)^ + {y - y? + sly^ dfi 


where s'^y denotes the average 

- n 1 

i=l z=l 

As mentioned in Example 2.1, the denominator can be integrated in closed 
form: 


+ -2fi{x + y)+x'^ + y‘^ = 2{fj.-y2[x + y])‘^ + y2{x-y)'^ . 

Hence, if sly^ = 1 / 2 (x - V? + sly, 
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Fig. 2.2. Evolution of the Bayes factor approximation B2\{T>n) as a function of r, 
when comparing the fifth and the sixth sessions of Illingworth’s experiment. 


J [if^ - xf + - yf + sly] " d/i 

= J [2(m - V 2 [s + y]f + V 2 (s - yf + sly] 


= J [‘^{y-y‘^[x + y]f+ sly,] ” d^i 

= i / [2 (m-V 2[S + y])7s'y.+ 1]”” d/r 

^xyz 


1 


c2n 

^xyz 


2(2n-l) 

_{2n-l)sly^ 


(m 


^/2[x + y\f + 1 


—n 


d/i 


1 /"(»- ^l2)^(2n- l)7r 

72 ( 2 n- 1) 

1 r{n— Y2)i/7r 

y2r(n) ’ 


by identification of the missing constant in the t density (see Exercise 2.81. 

The integral in /i in the numerator can be found in the same way and it 
leads to the simplified form of Example 2,2: 




J [(2C + x-yf + 2 sly] e d^/W 2i 


[{x - y)"^ + 2 sly] 


-n+1/2 


The numerator can be aproximated by simulations from a normal c/E(0,r^) 
distribution. Therefore, simulating a normal sample of ^i’s (z = 

1,..., fV) produces a converging estimate of as 
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An R implementation is as follows: 

> illing=as.matrix(normaldata) 

> xsain=illing[illing[,l]==5,2] 

> xbar=meaii(xsam) 

[1] -0.041 

> ysain=illing[illing[,l]==6,2] 

> ybar=mean(ysam) 

[1] -0.025 

> Ssquar=9*(var(xsarni)+var(ysam))/lO 
[1] 0.101474 

> Nsim=10"4 

> montecarl=rnorm(Nsim) 

> BF=tau=seq(.1,10,le=100) 

> for (t in 1:100) 

BF[t]=mean(((2*tau[t]*montecarl+xbar-ybar)~2+2*Ssquar)"(-8.5))/ 
((xbar-ybar)"2+2*Ssquar)~(-8.5) 

> plot(tan,BF,type="l") 

2.15 In the setup of Example 2.1, show that, when ^ ~ ^(0, tr^), the Bayes 
factor can be expressed in closed form using the normalizing constant of the t 
distribution (see Exercise 2.81 


When ^ ~ ,yh(0,(T^), we have 



In the numerator, 


n[{p - - xf + i - yf + sly] + 



implies 
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y^2n{2n + 1) 




^(2,^-l)(!i-y)- 
2(2r. + l) 




=r(n)2"+in“ 


(2n- l)(a:-y)^ 
2(2n+ 1) 


■^xy 


y'n{2n + 1) 

Similarly, for the denominator 

- xf + {^i-yf =2 {fj,- y2[x + f\f + V2(S - yf ■ 


and 


J ^-n[i^-xf + {t,-yf+sl^]/2a^ dcr^ d/i 

_ J g-"[2(M-V2[S+l/])^ + V2(2-y)^+s^„]/2(T^ ^-2n-2 ^^2 
V^TT f [i/2(g_y)2+g2^]/2a= - 2 n -2 n 2 


\/2jn 

\fn 


r{n)2-n- [i/2(x-yf+ 4]-" 


Therefore, 


S2"l(2?n) = 


, ^ T(n)2"+in-" 

Y'n(2n+l) ^ ’ 


{ 2 n-\){x-yf 2 

2 ( 2 n+l) °xy 


4 T(n)2"n-" [V2(x - vY + ’ 


( 2 n-l)(a;-y)^ 2 

2 ( 2 ra+l) ^a:y 


V 2 n+ 1 [ 1 / 2 ( 0 ; - 2/)2 + 4 ] ■ 


2.16 Discuss what happens to the importance sampling approximation when 
the support of g is larger than the support of 7 . 


If the support of 7 , ©.y, is smaller than the support of g, the representation 


3 = 


Hx)g{x) 

l{x) 


7(0;) da; 


is not valid and the importance sampling approximation evaluates instead the 
integral 

f h{x)g{x) 

/ - lix) dx. 

Je, l{x) 
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2.17 Show that, when 7 is the normal cyV(0, vjiv — 2)) density and is the 
density of the t distribution with v degrees of freedom, the ratio 

fl{x) ^x'^{v- 2 )l 2 v 

does not have a finite integral. What does this imply about the variance of the 
importance weights? 

Deduce that the importance weights of Example 2.3 have infinite variance. 


The importance weight is 


e-xp{{e - {xi-ef] ^ 

2 = 1 


with 9 ^ While its expectation is finite—it would be equal to 1 

were we to use the right normalising constants—, the expectation of its square 
is not: 

/ n 

exp {(6' - /i)^/2} JJ[1 + (x, - = +cx), 

2=1 

due to the dominance of the exponential term over the polynomial term. 


2.18 If fi, denotes the density of the Student t distribution (see 

Exercise 2.81, consider the integral 


a = 


1 — X 


fv{x)dx. 


1. Show that 3 is finite but that 

[ /i.(a:)da: = 00. 

2. Discuss the respective merits of the following importance functions 7 

- the density of the Student 0,1) distribution, 

- the density of the Cauchy ^(0,1) distribution, 

- the density of the normal 7 K( 0 , — 2 )) distribution. 

In particular, show via an R simulation experiment that these different choices 
all lead to unreliable estimates of 3 and deduce that the three corresponding 
estimators have infinite variance. 

3. Discuss the alternative choice of a gamma distribution folded at 1 , that is, 
the distribution of x symmetric around 1 and such that 
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|a; - 1| ~ Ga{a, 1). 

Show that 

oc |1 - exp|l - x\ 

l{x) 

is integrable around x = 1 when a < 1 but not at infinity. Run a simulation 
experiment to evaluate the performances of this new proposal. 


1. The integral 3 is finite when i' > 1/2 since the function 



Mx) 


is equivalent to at a; = ± 00 . Since 1 / + 1/2 > 1, the 

function is integrable. (The condition v > 1/2 is missing in the text of 
the exercise.) Similarly, at a: « 1, the function is equivalent to |1 — 
which is integrable. 

The function 


is not integrable at x = 1 since it is equivalent to 1/|1 — x|. 

2. Using as importance function 7 

- the density of the Student 0,1) distribution produces an impor¬ 

tance weight of 1 and an infinite variance estimator since the integrand 
is not square integrable; 

- the density of the Cauchy ^(0,1) distribution produces a well-behaved 
importance weight since the Cauchy has heavier tails when la > 1 / 2 , 
however, the integrability problem at x = 1 remains, hence an impor¬ 
tance sampling estimate with infinite variance; 

- the density of the normal .yC( 0 , v/[y—2)') distribution faces difficulties 
both with integrability of the squared integrand at x = 1 and with the 
infinite variance of the importance weight due to thinner tails. 

When evaluating the performances of the three solutions in R, one can 

use the following: 

grand=function(x,nu=3){ 
sqrt(abs(x)/abs(1-x))} 

N=10'3 

sajiipone=rt (N, df=3) 
sajiiptwo=rcauchy (N) 
sajiiptre=rnorm(N) 

weitwo=dt(samptwo,df=3)/dcauchy(samiptwo) 
weitre=dt(samptre,df=3)/dnorm(samptre) 
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Fig. 2.3. Evolution of three importance sampling evaluations of the integral J using 
a normal sample (gold), a ts sample (blue), and a Cauchy sample (sienna). 


plot(cumsum(grEind(scmiptwo)*weitwo)/(l :N) ,type="l" , 

xlab="simulations",ylab="cumulated average",lwd=2,col="sienna" 
lines(cumsum(grand(samptre)*weitre)/(1:N),col="steelblue",lwd=2) 
lines(cumsum(grand(sampone))/(1:N),col="gold2",lwd=2) 

Running the above code several times exhibits variability in the outcome, 
with sometimes agreement between the estimators and sometimes huge 
jumps in some of the series, as exemplified by Figure in this manual. 

3. If we consider instead the folded Gamma solution, its density is 




|1 — a; 


g-|l-=r| 


Therefore, taking h{x) = |a;|/|l —a;| (missing from the text of the exercise). 


h{x) 


P{x) 

l[x) 


oc \/R/^(x) |1 - a;|^ “ ^exp|l-x| 


which is integrable around x = 1 when a < 1 but not at x = ±oo. 
Running the R code 


alpha=.5 

y=rgamma(N,sh=alpha) 

x=sample(c(-1,1),N,rep=TRUE)*y+l 

weiqar=2*dt(x,df=3)/dgamma(y,sh=alpha) 


does not show a considerable improvement in the evaluation of the integral 
(Figure 2.4 in this manual). (It may be noted that in this particular run, 
the folded Gamma solution does provide the estimation the closest to the 
true value.) 
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Fig. 2.4. Evolution of three importance sampling evaluations of the integral 3 using 
a normal sample (gold), a tz sample (blue), a Cauchy sample (sienna), and a folded 
Gamma Q{.5, 1) (tomato). 


2.19 Evaluate the harmonic mean approximation 


mi 




when applied to the a^) model, normaldata, and an 1) prior on a^. 

Given a normal ./K(0 ,(t^) sample and a 1^(1,1) prior on r = the 
posterior on r is simply 

7r(T|I?„) oc exp exp{—r} = t'^'^ exp 

which means that the posterior distribution on r is a 

^ ( V2 + 1, V2^a;2 + 1 j 


1 + 1 / 2 ^; 


i=l 


i=l 


distribution. 

Evaluting the harmonic mean approximation thus implies producing a 
sample from the posterior 

N=10'4 

sinitau=rgamma(N,sh=33,rat=l+.5*sum(normaldata$x2)) 
and averaging the inverse likelihoods 








2 Normal Models 


19 


> kood=function(tau){ (2*pi/tau)"(-32)*exp(-0.5*sum(normaldata$x2~2)*tau) } 

> l/mean(l/kood(simtau)) 

[1] 1.149142e-21 

If we repeat this experiment many times, the estimates remain within this 
order of magnitude. However, the true value of the marginal likelihood is 


(27r)-"/= 



1 + 1 / 2 ^ 

Z=1 



dr = (27r) r'(”/ 2 )) 


i=l 


-l-"/2 


equal to 

> (2*pi)"(-32)*gaimna(32)/(1+0.5*sum(normaldata$x2~2))"33 
[1] 0.0001717292 

There is therefore no connection between the estimate and the true value of 
the marginal likelihood, confirming our warning that it should not be used. 
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3.1 Show that the matrix Z is of full rank if and only if the matrix Z^Z is in¬ 
vertible (where Z^ denotes the transpose of the matrix Z, which can be produced 
in R using the t(Z) command). Apply to Z = [1„ X] and deduce that this 
cannot happen when p -|- 1 > n. 

The matrix X is a (n, fc -|- 1) matrix. It is of full rank if the fc -|- 1 columns 
of X induce a subspace of K" of dimension (A: - 1 - 1), or, in other words, if 
those columns are linearly independent: there exists no solution to X 7 = 0 „ 
other than 7 = 0„, where O^+i denotes the {k + l)-dimensional vector made 
of O’s. If X^X is invertible, then Xj = 0„ implies X'^X^ = X^0„ = Ofe+i 
and thus 7 = {X~^X)~^Qk+i = Ofc-i-i: therefore X is of full rank. If X^X is 
not invertible, there exist vectors /3 and 77 ^/? such that X^XjS = X^X 7 , 
i.e. X'^X{I3 — 7 ) = Ofc+i. This implies that ||X(^ — 7 )!^ = 0 and hence 
X(/3 — 7 ) = 0„ for /? — 7 7 ^ Ofc+i, thus X is not of full rank. 

Obviously, the matrix (A: -I- I, fc -|- 1) matrix X^X cannot be invertible if 
A: -h 1 > n since the columns of X are then necessarily linearly dependent. 

3.2 Show that solving the minimization program 

mm (y - X/3)'^(y - X/3) 

requires solving the system of equations (X^X)/3 = X^y. Check that this can 
be done via the R command solve(t(X)7o*yo(X) ,t (X)7„*7,y). 

If we decompose (y — X/3)^(y — Xfi) as 

yTy - 2 yTx^ + 0^X^Xp 
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and differentiate this expression in (3, we obtain the equation 
-2yTx + 20^X^X = Ok+i , 


i.e. 

iX^X)(3 = X^y 


by transposing the above. 

As can be checked via help (solve), solve (A, b) is the R function that 
solves the linear equation system Ax = b. Defining X and y from caterpillar, 
we get 

> solve (t (X) 7o*"/X, t (X) 7o*7,y) 


[, 1 ] 


repd, 33) 

10.998412367 

VI 

-0.004430805 

V2 

-0.053830053 

V3 

0.067939357 

V4 

-1.293636435 

V5 

0.231636755 

V6 

-0.356799738 

V7 

-0.237469094 

V8 

0.181060170 

V9 

-1.285316143 

VIO 

-0.433105521 

which [obviously] 
function lm(): 

> Im(y'X-l) 

gives the same 

Call: 

lm(formula = 

\ 

><! 

1 

Coefficients 


XrepCl, 33) 

XVI 

10.998412 ■ 

-0.004431 -0 

XV6 

XV7 


-0.356800 -0.237469 


XV2 

053830 

XV8 

0.181060 


XV3 

0.067939 

XV9 

-1.285316 


the linear regression 


XV4 XV5 

-1.29363 0.23163 

XVIO 

-0.43310 


Note the use of the -1 in the formula y~X-l that eliminates the intercept 
already contained in X. 


3.3 Show that the variance of the maximum likelihood estimator of /3 in the 
regression model is given by V(/3 |(t^) = cr^(X^X)"^. 
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Since j3 = {X~^X) ^X~^y is a linear transform of y ^ ^{X f3, a'^ In), we 
have 

/3 ~ ^ {{X^X)-^X^Xl3,a‘^{X^X)-^X^X{X^X)-^) , 
P^^{P,a^iX^X)-^) . 


3.4 For the model 

y|/3,fT2~^„(X/3,a2l„) 

a conjugate prior distribution is as follows: the conditional distribution of (3 is 
given by 

P\a^ ^ , 

where M is a {p,p) positive definite symmetric matrix, and the marginal prior on 
(T^ is an inverse Gamma distribution 

cr^ ~ 6), a,b>0. 

Taking advantage of the matrix identities 

(M + = M~i - M-i (M"i + (X'^X)-^)”^ M"i 

= (X'^X)-i - (X'^X)-^ (M-^ + (X'^X)-^)”^ (X'^X)”^ 


and 


X'rX(M + X'^Xj-^M = (M-i(M + X'rX)(X'rX)-^) 
= (M-i + (XTx)-i)”^ , 


establish that 

P\y, ((M + X^X)-^{{X^X)P + M/3}, + X^Xj-i) (3.8) 


where /3 = (XTX)-iXTy and 


(3.9) 

where = (y — /3X)^(y — /3X) are the correct posterior distributions. Give a 
(1 — a) HPD region on (3. 


, u s' (/3 -/3)T (M“1 + (X^X) 


Starting from the prior distribution 


the posterior distribution is 
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7r(/3, X) OC cr 1 2a 2 n 

+(/3-/3)T(xTx)(/3-/3) + s 2 + 26} 

= ^-k-n-2a-3 ^ - 20^ [M^ + 

2 cr^ I 

+0^MP + j3^{X^X)P + + 26 | 

= exp ^ {(/3 - E[/3|y, X])T(M + X){P - E[/3|y, X]) 

+/ 3 TM /3 + /3T(xTx)/ 3 - E[/3|y, X^{M + xTx)E[/3|2/, X] + + 2&} 

with 

E[P\y,X] = {M + X^X)-\MP + X^X0 . 

Therefore, (3.3) is the conditional posterior distribution of (3 given . Inte¬ 
grating out f3 leads to 

7r(a2|/3, s^, X) cx exp ^ + 0^{X^X)p 

-E[/3|y, X]T(M + XTx)E[/3|y, X] + + 2b} 

= ct-"- 2“-2 exp + p'^iX~^X)P + s^ + 2b 

2(7^ I 

-(M/3 -b X'rX/3)'^(M -b X~^X)-0Ml3 + X^^X^)! 

Using the first matrix identity, we get that 

iMi3+X~^Xl30 (M -b X'^X)"^ (M/3 -b X'^X/^) 

= p'^Mp - p'^ (M-i -b (X'^X)-!)”^ p 

+ P^{X'^X)P - P^ (M-1 -b (X'^X)-!)”^ P 

+ 2P^{X'^X) (M -b X'^X)"^ Aip 

= ~P'^ Mp + P'^ {X^ X)p 

-iP- /3)T (m-1 + (X^X)-!)-' (/3 - /3) 

by virtue of the second identity. Therefore, 

7 r((j^|/3,s^X) cx cr-”-2“-2 exp^ |(^-/3)'^ (M"^ 

+ (XTX)-1)”^ (/3 - /3) + + 25 } 

which is the distribution (3.4). 

Since 

/3|y, X ~ P^k+i [n + 2a, /i, Ej , 
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this means that 


n{l3\y,X) (X ^ 



n + 2a 


(n+2a+fc+l) 


and therefore that an HPD region is of the form 

, (/3 - - A) < fcaj , 


where is determined by the coverage probability a. 

Now, (/3 — — fi) has the same distribution as \\z\\^ when 

z ~ ^k+i{n + 2a, 0, Ik+i)- This distribution is Fisher’s + 1, n + 2a) dis¬ 
tribution, which means that the bound ka is determined by the quantiles of 
this distribution. 


3.5 The regression model of Exercise 3.4 can also be used in a predictive sense: 
for a given {m,p+ 1) explanatory matrix X, i.e., when predicting m unobserved 
variates the corresponding outcome y can be inferred through the predictive 
distribution 7r(y|(T^,y). Show that 7r(y|(T^,y) is a Gaussian density with mean 

E^[y\a^,y] = X{M + X^X)-^{X^X^+ M^) 

and covariance matrix 

V"(y|a2,y) = + X(M + X'^Xy^X^). 


Deduce that 

y|y -Xn(n + 2a, X(M + X'^Xy^X'^Xp + M/3), 

2b + s^ + y- ^y + (X^X)"!)"^ (A - /3) 

n -h 2a 

X {l^ + X(M -b XTX)-1XT J) . 


Once again, integrating the normal distribution over the inverse gamma 
random variable cr^ produces a Student’s distribution. Since 

V ~ IS [I ^ - isfx-xw - ffl) 

under Zellner’s G-prior, the predictive distribution is a 


y|y,X,X - 3^k+i n,X 


/3 + C/3 cis^ + y-pyx^xy-y/{c + i)) 


c-b 1 ’ 


n(c -b 1) 


im + -—xix^xy^x^ 

c+l 


X 
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distribution. 

3.6 Show that the marginal distribution of y associated with (3.8) and (3.9) is 
given by 

The joint posterior is given by 

~ J^^((n - fc - l)/2, s^/2). 


{ 2 a, X^, ^ (I„ + XM”1xT) 


Therefore, 


li\y,X^%+, {n-k-l,p,-^^{X^X)-^'^ 
by the same argument as in the previous exercises. 

3.7 Show that the matrix (I„ + 5 X(X^X)“^X^) has 1 and 5 + 1 as only 
eigenvalues. {Hint: Show that the eigenvectors associated with 5 + 1 are of the 
form X/3 and that the eigenvectors associated with 1 are those orthogonal to 
X). Deduce that the determinant of the matrix (I„ + gX(X^X)“^X^) is indeed 

(5 + l)P+i. 


Given the hint, this is somewhat obvious: 

{In + cX{X~^X)-^X~^)XP = Xj3 + cX{X~^X)-^X~^XP 

= {c+l)X/3 

{In + cX{X~^X)-^X~^)z = z + cX{X~^X)-^X~^z 

= Z 

for all /3’s in and all z’s orthogonal to X. Since the addition of those two 
subspaces generates a vector space of dimension n, this defines the whole set 
of eigenvectors for both eigenvalues. And since the vector subspace generated 
by X is of dimension (A: + 1), this means that the determinant of 

{In+cX{X^X)-^X^) 


is (c+ 1)'=+! X 
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3.8 Under the Jeffreys prior, give the predictive distribution of y, m dimensional 
vector corresponding to the {m,p) matrix of explanatory variables X. 


This predictive can be derived from Exercise 3.5 Indeed, Jeffreys’ prior 
is nothing but a special case of conjugate prior with a = b = 0. Therefore, 
Exercise |3.5| implies that, in this limiting case. 


y|y, (n, X{M + X'^X)-\x'^Xp + M^), 

g2 + (^ - py ^ CP - P) 

n 

X {/^ + X{M + J) . 


3.9 If {xi,X 2 ) is distributed from the uniform distribution on 

{{xi,X2)-, {xi - 1)^ + {X2 - 1)^ < l}u{(a;i,a; 2 ); (a;i + 1)^ + (x2 + 1)^ < l} , 

show/ that the Gibbs sampler does not produce an irreducible chain. For this dis¬ 
tribution, find an alternative Gibbs sampler that w/orks. {Hint: Consider a rotation 
of the coordinate axes.) 


The support of this uniform distribution is made of two disks with re¬ 
spective centers (-1,-1) and (1,1), and with radius 1. This support is not 
connected (see Figure 3.1 in this manual) and con ditioni ng on x i < 0 m eans 
that the conditional distribution of X 2 is {—\—^J\ — — x\, thus 

cannot produce a value in [0,1]. Similarly, when simulating the next value of 
Xi, it necessarily remains negative. The Gibbs sampler thus produces two 
types of chains, depending on whether or not it is started from the negative 
disk. If we now consider the Gibbs sampler for the new parameterisation 


yi= X1+X2, y2=X2-Xi, 

conditioning on yi produces a uniform distribution on the union of a negative 
and of a positive interval. Therefore, one iteration of the Gibbs sampler is 
sufficient to jump [with positive probability] from one disk to the other one. 

3.10 If a joint density g{yi,y 2 ) corresponds to the conditional distributions 
9 i{yi\y 2 ) and g 2 {y 2 \yi), show that it is given by 

/ X ^ 92{y2\yi) 

^ / g 2 iv\yi)/gi{yi\v) dv' 
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Fig. 3.1. Support of the uniform distribution. 


If the joint density 5(1/1, j/2) exists, then 

5(1/1,52) = 5^(51)52(52151) 
= 5^(52)51(51152) 


where 5^ and 5^ denote the densities of the marginal distributions of 51 and 
52, respectively. Thus, 


5 ^( 51 ) = 


a 


5i(5i|52) 

52(52151) 

5i(5i|52) 

52(52151) 


5 ^( 52 ) 


as a function of 51 [5^(52) is irrelevant]. Since 5^ is a density. 


5 ^( 51 ) 


51(51 1 52) 
52 ( 52151 ) 


5i(^dii 

52(52111) 


and 

5(51,52) = 5 l( 5 l| 52 ) / 

/ J 52(52|ll) 

Since 51 and 52 play symmetric roles in this derivation, the symmetric version 
also holds. 


3.11 Considering the model 

r]\6 ~ Sin(n, 9), 5 ~ i3e(a, 6), 
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derive the joint distribution of {ri,6) and the corresponding full conditional distri¬ 
butions. Implement a Gibbs sampler associated with those full conditionals and 
compare the outcome of the Gibbs sampler on 6 with the true marginal distribu¬ 
tion of 9. 

The joint density of (ry, 9) is 

9) oc 9'^{l - 6')"-’' r(l - 9f . 

The full conditionals are therefore 


rj\9 ~ ;Bin(n, 9) 9\'q ~ Be{a + r],h + n — rf). 


This means running a Gibbs sampler is straightforward: 


# pseudo-data 
n=18 
a=b=2.5 
N=10'5 

#storage matrix 

#col.l for eta, col.2 for theta 
gibb=matrix(NA,N,2) 
gibb [1,1]=sample(0:n,1) 

gibb[1,2]=rbeta(l,a+gibb[1,1],b+n-gibb[1,1]) 
for (t in 2:N){ 

gibb[t,1]=rbinom(l,n,gibb[t-1,2]) 

gibb[t,2]=rbeta(l,a+gibb[t,1],b+n-gibb[t,1])} 

The output of the above algorithm can be compared with the true marginal 
distribution, namely the Be{a, b) distribution 


hist(gibb[,2],prob=TRUE,col="wheat") 
curve(dbetaCx,a,b),add=TRUE,lwd=2) 


which shows indeed a very good fit (Figure 3.2 in this manual). 


3.12 Take the posterior distribution on {9,a'^) associated with the joint model 


Xi\9,a^ i = 


9r^^{9o,T^), a^--W{a,b). 

Show that the full conditional distributions are given by 


lx, ~ ,/F 


cr^ 

r'o H-- n X, 




+ nr 


and 
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0.0 0.2 0.4 0.6 0.8 1.0 


Fig. 3.2. Fit of the Gibbs output to the Beta B(5/2, 5 / 2 ) distribution. 


cr^lx, 6» ~ M + a, ^ Y^{x^ - Bf + bj , 

where x is the empirical average of the observations. Implement the Gibbs sampler 
associated with these conditionals. 

From the full posterior density 


7r(6»,CT^|x) 


n 


oc J|exp{-(a:i - 6*)^/2 (t^} exp{-(6» - 6»o)^/2r^} (ct^) “ \ 

= ((7^)“"/^“““^ exp{—n(a; — B)^ j2a^ — s^/2(t^ — (B — Bo)^j2T^ 


exp{—6/cr^} 
-bla^} 


we derive easily that 

7 r(0|x, cr) oc exp{— n(a; — B)’^ j2a^ — (B — 0o)^/2r^} , 


which leads to 


0 |x, a 


2 




+ nr^ 


^*0 + 


+ UT 


2 > 


Cp- + flT^ ) 


Similarly, 


7r(CT^|x, 0) oc (cr^) ^ exp{—— 0)^/2cr^ — 5 /(t^} , 

i=l 


hence 
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a2|x,6»~/^ \^/2 + a,y2^{xi- ef + b 

Running an R code based on those two conditionals is straightforward: 

# pseudo-data 
n=1492 
x=rnorm(n) 
meEinx=mecUi (x) 
varx=var(x)*(n-1) 
a=b=2.5 

tau=5 

meantop=n*tau*meanx 
apost=a+(n/2) 

# Gibbs parameters 
N=10'4 

gibb=matrix(NA,N,2) 

gibb[1,1]=rnorm(l,mean(x),6) 

gibb[1,2]=l/rgamma(l,sh=apost,rate=b+0.5*sum((x-gibb[1,1])"2)) 
for (t in 2:N){ 

gibb [t, 1] =rnorm(l ,mecLn=mecintop/ (gibb [t-1,2] +n*tau) , 
sd=sqrt(gibb[t-1,2]*tau/(gibb[t-1,2]+n*tau))) 
gibb[t,2]=l/rgamma(l,sh=apost,rate=b+0.5*sum((x-gibb [t,1])“2)) 

> 

# remove warmup 
gibb=gibb[(N/10):N,] 
par(mfrow=c(1,2)) 

plot(gibb,typ="l",col="gray",ylab=expression(sigma~2)} 
grid.the=seq(-.15,.15,le=lll) 
grid.sig=seq(.8,1.2,le=123) 
like=function(the,sig){ 

-.5*n*(meanx-the)"2/sig-.5*varx/sig-.5*n*log(sig)- 

dnorm(the,sd=sqrt(tau),log=TRUE)-dgcmima(1/sig,sh=a,rat=b,log=TRUE)} 
post=matrix(NA,111,123) 
for (i in 1:111) 

post [i,]=like(grid.the[i],grid.sig) 
image(grid.the,grid.sig,post) 
points(gibb,cex=.4,col="sienna") 
contour(grid.the,grid.sig,post,add=TRUE) 

Figure |3.3| in this manualshows how the Gibbs sample fits the target, after 
eliminating 10^ iterations as warmup. 




Fig. 3.3. Gibbs output for the normal posterior with (left) Gibbs path and (right) 
superposition with the log-posterior. 
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4.1 Show that, for the logistic regression model, the statistic X)r=i Vi 
ficient when conditioning on the x*’s (1 < f < n), and give the corresponding 
family of conjugate priors. 


The likelihood associated with a sample ((j/i, xi),..., (j/„,x„)) from a logistic 
model writes as 


^(^|y,x) 


A/ exp(x-T/3) y / 1 

4:^ \ 1 + exp(x^’'’/3 )) \ 1 + exp(x^’'’/3 )) 


= exp 



n[l + exp(x*T/3)] . 

i=l 


Hence, if we consider the x*’s as given, the part of the density that only 
depends on the yiS is 

exp I ^ 2 /i x*"^/? 

I i=i 

and factorises through the statistic Vi x*- 

This implies that the prior distribution with density 

7 r(/3|^o,A) (xexp{^([/3} / [l + exp(x*V)]^ 

' i=l 


is conjugate, since the corresponding posterior is 7r(/3|^o + ?/i x*, A + 1). 

4.2 Show that the logarithmic link is the canonical link function in the case of 
the Poisson regression model. 
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The likelihood of the Poisson regression model is 

^(/3|y,^) = n exp{?/,x*'^/3-exp(x*V)} 

n ^ 

i=i y^- 

so log(/j,i) = x®^/3 and the logarithmic link is indeed the canonical link func¬ 
tion. 


4.3 Suppose yi,... ,yk are independent Poisson ^{yi) random variables. Show 
that, conditional on n = Vi’ 

y = (2/1, ■ • ■, 2/fc) ~ ^k{n-, ai,..., Ofe), 
and determine the ai’s. 


The joint distribution of y is 

f(y\y-i,...,tJ-k) = n exp|-^/r,| , 

while n = t^i) [which can be established using the mo¬ 

ment generating function of the V{y.) distribution]. Therefore, the conditional 
distribution of y given n is 


f{y\y,i,...,y,k,n) 


nL (S)gxp{-ELit..} 



P. TT / 

nL.!/.! iUsLif.. 



which is the pdf of the A4fc(n; ai,..., au) distribution, with 


This conditional representation is a standard property used in the sta¬ 
tistical analysis of contingency tables (Section 4.5): when the margins are 
random, the cells are Poisson while, when the margins are fixed, the cells are 
multinomial. 
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4.4 For TT the density of an inverse normal distribution with parameters 9i = 3/2 
and 02 = 2, 

7 r(a;) oc exp(— 3 / 2 a; — 2 /a;)Ia;>o, 

write down and implement an independence MH sampler with a Gamma proposal 
with parameters (a,/3) = (4/3,1) and {a,/3) = (0.5a/4/3,0.5). 

A possible R code for running an independence Metropolis-Hastings sam¬ 
pler in this setting is as follows: 

# target density 

target=function(x,thel=l.5,the2=2){ 
x"(-thel)*exp(-thel*x-the2/x) 

> 

al=4/3 

bet=l 

# initial value 
mcmc=rep(l,1000) 

for (t in 2:1000){ 

y = rgcmfflia(l,shape=al,rate=bet) 

if (runif(l)<target(y)*dgcmima(mcmc[t-1],shape=al,rate=bet)/ 
(target (memo [t-1] ) *dgaiimia(y, shape=al ,rate=bet) ) ) 
mcme[t]=y 
else 

memo[t]=mcmc [t-1] 

} 

# plots 

par(mfrow=c(2,1),mar=c(4,2,2,!)) 

res=hist(mcme,freq=F,nclass=55,prob=T,col="grey56", 
ylab="",main="") 

lines(seq(0.01,4,length=500),valpi*max(res$int)/max(valpi), 
lwd=2,col="sienna2") 

plot(mcmc,type="l",col="steelblue2",lwd=2) 


The output of this code is illustrated on Figure [TT] in this manual and shows 
a reasonable fit of the target by the histogram and a proper mixing behaviour. 
Out of the 1000 iterations in this example, 600 corresponded to an acceptance 
of the Gamma random variable. (Note that to plot the density on the same 
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scale as the histogram, we resorted to a trick by identifying the maxima of 
the histogram and of the density.) 



mcmc 



I ndex 

Fig. 4.1. Output of an MCMC simulation of the inverse normal distribution. 


4.5 Consider xi, X 2 , and X 3 iid ‘^{9, 1 ), and 7 r( 0 ) oc exp(— 0^/100). Show that 
the posterior distribution of 0 , 7 r( 0 |xi, X 2 , X 3 ), is proportional to 

exp(-0VlOO)[(l + {9- xi)2)(l + {9- X2)')(l + {9- (4.1) 

and that it is trimodal when xi = 0, X 2 = 5, and X 3 = 9. Using a random walk 
based on the Cauchy distribution ^(0, ct^), estimate the posterior mean of 9 using 
different values of In each case, monitor the convergence. 


The function (4.1) appears as the product of the [Normal] prior by the 
three [Cauchy] densities f{xi\9). The trimodality of the posterior can be 
checked on a graph when plotting the function (4.1). 

A random walk Metropolis-Hastings algorithm can be coded as follows 


x=c(0,5,9) 

# target 

targ=function(y) { 

dnormCy,sd=sqrt(50))*dt(y-x[1],df=1)* 
dt(y-x[2],df=l)*dt(y-x[3],df=l) 

} 


# Checking trimodality 
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plot(seq(-2,15,length=250), 

targ(seq(-2,15,length=250)),type="l") 

sigma=c(.001,.05,1)*9 # different scales 
N=100000 # number of mcmc iterations 


mcmc=matrix(mean(x),ncol=3,nrow=N) 
for (t in 2:N){ 


mcmc[t,]=mcmc[t-1,] 
y=mcmc[t,]+sigma*rt(3,1) # rnorm(3) 
valid=(runif(3)<targ(y)/targCmcmc[t-1,])) 
mcmc[t,valid]=y[valid] 

> 


The comparison of the three cumulated averages is given in Figure 4.2 in this 


manual and shows that, for the Cauchy noise, both large scales are acceptable 
while the smallest scale slows down the convergence properties of the chain. 
For the normal noise, these features are exacerbated in the sense that the 
smallest scale does not produce convergence for the number of iterations under 
study [the blue curve leaves the window of observation], the medium scale 
induces some variability and it is only the largest scale that gives an acceptable 
approximation to the mean of the distribution (4.1). 


4.6 Estimate the mean of a i^a(4.3,6.2) random variable using 

1. direct sampling from the distribution via the R command 

> x=rgamma(n,4.3,scale=6.2) 

2. Metropolis-Hastings with a ifa(4,7) proposal distribution; 

3. Metropolis-Hastings with a ^a(5,6) proposal distribution. 

In each case, monitor the convergence of the cumulated average. 


Both independence Metropolis-Hastings samplers can be implemented via 
an R code like 

al=4.3 
bet=6.2 

mcmc=rep(l,1000) 
for (t in 2:1000){ 

mcmc[,t]=mcmc [,t-l] 
y = rgcmima(500,4,rate=7) 

if (runif(l)< dgcmima(y,al,rate=bet)*dgamma(mcmc[t-1],4,rate=7)/ 
(dgammaCmcmc[t-1],al,rate=bet)*dgamma(y,4,rate=7))){ 
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iterations 



iterations 


Fig. 4.2. Comparison of the three scale factors a = .009 (blue), a = .45 (gold) and 
(7 = 9 (brown), when using a Cauchy noise (left) and a normal noise (right). 


mcmc[t]=y 

} 

} 

aver=cumsum(mcmc)/1:1000 


When comparing those samplers, their variability can only be evaluated 
through repeated calls to the above code, in order to produce a range of out¬ 
puts for the three methods. For instance, one can define a matrix of cumulated 
averages aver=matrix (0,250,1000) and take the range of the cumulated av¬ 
erages over the 250 repetitions as in rcuij=apply(aver, 1 .range), leading to 
something similar to Figure 4.3 in this manual. The complete code for one of 
the ranges is 














al=4.3 
bet=6.2 
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mcmc=matrix(1,ncol=1000,nrow=500) 
for (t in 2:1000){ 
mcmc[,t]=mcmc 
y = rgcmfflia(500,4,rate=7) 
valid=(runif(500)<dgcmima(y,al,rate=bet)* 

dgaimna(mcmc[i,t-l],4,rate=7)/(dgamma(mcmc[,t-l],al,rate=bet)* 
dgammaCy,4,rate=7))) 
mcmc[valid,t]=y[valid] 

} 

aver2=apply(mcmc,1,cumsum) 
aver2=t(aver2/(1:1000)) 
ranj 2=apply(aver2,2,range) 

plot(ranj2[1,],type="l",ylim=range(ranj2),ylab="") 
polygon(c (1:1000,1000 :1) , c(ranj2 [2,] ,rev(rELnj2 [1,] ) ) ) 

which removes the Monte Carlo loop over the 500 replications by running 
the simulations in parallel. We can notice on Figure [4[3] in this manual that, 
while the output from the third sampler is quite similar with the output from 
the iid sampler [since we use the same scale on the y axis], the Metropolis- 
Hastings algorithm based on the (^a(4, 7) proposal is rather biased, which 
may indicate a difficulty in converging to the stationary distribution. This is 
somehow an expected problem, in the sense that the ratio target-over-proposal 
is proportional to exp(0.8a;), which is explosive at both x = 0 and x = oo. 



O 200 400 GOO 800 O 200 400 GOO 800 O 200 400 GOO 800 


Fig. 4.3. Range of three samplers for the approximation of the 5fa(4.3, 6.2) mean: 
(left) iid; (center) 5fa(4, 7) proposal; (right) ^o(5, 6) proposal. 
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4.7 For a standard normal distribution as target, implement a Hastings- 
Metropolis algorithm with a mixture of five random walks with variances a = 
0.01,0.1,1,10,100 and equal weights. Compare its output with the output of 
Figure 4.2 (in the book). 

We thus compare the R code provided in the book 

tim=f unction (n, xO, sigma2) { 
x=rep(x0,n) 
for (i in 2;n){ 

y=rnorm(l,x[i-l],sqrt(sigma2)) 

if (runif(l)<=exp(-0.5*(y"2-x[i-l]"2))) x[i]=y 

else x[i]=x[i-l] 

} 

X 

> 

with a mixture version 


mhm=function(n,xO){ 
x=rep(x0,n) 

sigmas=c(0.01,0.1,1,10,100) 
for (i in 2;n){ 

y=rnorm(1,X[i-1],sqrt(samiple(sigmas,1))) 
if (runif(1)<=exp(-0.5*(y~2-x[i-1]"2))) x[i]=y 
else x[i]=x[i-l] 

} 


The outcome from the mixture version in Figure 4.4 in this manual is quite 
an improvement when compared with Figure 4.2 from the book. 


4.8 For the probit model under flat prior, find conditions on the observed pairs 
(x*,?/i) for the posterior distribution above to be proper. 


This distribution is proper (i.e. well-defined) if the integral 


J= f3)y' [l-<?(x*V)]^ d/3 


is hnite. If we introduce the latent variable behind <?(x*^/3), we get by Fubini 
that 


3 


2 = 1 


{/3 , 2=1,...,n} 


d/3 dzi - ■■ dzn , 
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Fig. 4.4. Outcome of a Metropolis-Hastings simulation of a M^(0,1) target using 
a mixture of random walk proposals: (Top:) Sequence of 10,000 iterations; (mid¬ 
dle:) Histogram of sample compared with the target density; (bottom:) Empirical 
autocorrelations using R function acf. 


where ^ Zi means that the inequality is x®^/3 < if = 1 and x*^/3 < z, 
otherwise. Therefore, the inner integral is finite if and only if the set 

fP = {^ ; x*"^/? ^ Zi, z = 1 ,..., n} 

is compact. The fact that the whole integral 3 is hnite follows from the fact 
that the volume of the polyhedron defined by fp grows like |zi when Zi goes to 
infinity. This is however a rather less than explicit constraint on the (x®, yi)’s! 

4.9 For the probit model under non-informative prior, find conditions on 
and ~ Vi) posterior distribution defined by (4.4) to be proper. 

There is little difference with Exercise 14.81 because the additional term 
{X X)l3) is only creating a problem when /3 goes to 0. This dif- 

hculty is however superhcial since the power in is small enough 
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to be controlled by the power in ||X/3||^“^ in an appropriate polar change of 
variables. Nonetheless, this is the main reason why we need a 7 r((T^) oc 
prior rather than the traditional 7r(a^) oc which is not controlled in /? = 0 . 
(This is the limiting case, in the sense that the posterior is well-defined for 
7 r(cr^) oc 0 ’“^+'^ for all e > 0 .) 

4.10 Include an intercept in the probit analysis of bank and run the correspond¬ 
ing version of Algorithm 4.7 to discuss whether or not the posterior variance of 
the intercept is high. 


We simply need to add a column of I’s to the matrix X, as for instance in 

> X=as.matrix(cbind(rep(1,dim(X)[1]),X)) 

and then use the code provided in the function hmf latprobit, i.e. 

flatprobit=hmflatprobit(10000,y,X,1) 
par(mfrow=c(5,3),mar=l+c(1.5,1.5,1.5,1.5)) 
for (i in 1:5){ 

plot(flatprobit[,i],type="l",xlab="Iterations", 
ylab=expression(beta[i])) 

hist(flatprobit[1001:10000,i],nclass=50,prob=T,main="", 
xlab=expression(beta[i])) 
acf(flatprobit[1001:10000,i],lag=1000,main="", 
ylab="Autocorrelation",ci=F) 

} 

which produces the analysis of bank with an intercept factor. Figure [T5] in 
this manual gives the equivalent to Figure 4.4 [in the book]. The intercept /3o 
has a posterior variance equal to 7558.3, but this must be put in perspective 
in that the covariates of bank are taking their values in the magnitude of 100 
for the three first covariates and of 10 for the last covariate. The covariance 
of XiiPi is therefore of order 7000 as well. A noticeable difference with Figure 
4.4 [in the book] is that, with the inclusion of the intercept, the range of /3i’s 
supported by the posterior is now negative. 


4.11 Using the latent variable representation of the probit model, introduce 
Zi\l3 ~ jV (x®^/3, l) (1 < i < n) such that = Iz;<o- Deduce that 




^ (x*'r/3,1,0) if yi = l, 
^ 1,0) if yi = 0. 


where o/Lf (/r, 1 , 0 ) and JV- (/r, 1 , 0 ) are the normal distributions with mean /i and 
variance 1 that are left-truncated and right-truncated at 0, respectively. Check 
that those distributions can be simulated using the R commands 
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Fig. 4.5. bank: estimation of the probit coefficients [including one intercept /3o] via 
Algorithm 4.2 and a fiat prior. Left: /3i’s (i = 0,... ,4); center: histogram over the 
last 9,000 iterations; right: auto-correlation over the last 9,000 iterations. 


> xp=qnorm(runif (1) *pnorin(mu)+pnorin(-inu) )+mu 

> xm=qnorm(runif (1) *pnorin(-inu) )+mu 

Under the flat prior 7r(/3) oc 1, show that 

/3|y,z~./rfc((XTx)-ixTz,(XTx)-i) , 

where z = (zi ,..., z„), and derive the corresponding Gibbs sampler, sometimes 
called the Albert-Chib sampler. [Hint A good starting point is the maximum 
likelihood estimate of /3.) Compare the application to bank with the output in 
Figure 4.4 in this manual. {Note: Account for differences in computing time.) 








































































44 


4 Generalized Linear Models 


If ZilP ~ ^ l) is a latent [unobserved] variable, it can be related 

to Ui via the function 

TJi — lzi<0 ) 

since P{yi = 1) = P{zi > 0) = 1 — ^ (—x*^/3) = ‘P (x*^/3). The conditional 
distribution of Zi given yi is then a constrained normal distribution: if yi = 1, 
Zi <0 and therefore 


Zi\y^ = l,/3 ~ (x*'^/3,1,0) . 

(The symmetric case is obvious.) 

The command qnormCrunif (l)*pnorm(mu)+pnorm(-mu))+mu is a simple 
application of the inverse cdf transform principle given, e.g., in Robert and 
Casella (2004): the cdf of the P/+ {y, 1, 0) distribution is 

_ <l>ix -y)- <P{-y) 


(An alternative is to call the R library truncnorm.) If we condition on both 
z and y [the conjunction of which is defined as the “completed model”], the 
j/i’s get irrelevant and we are back to a linear regression model, for which the 
posterior distribution under a flat prior is given in Section 3.3.1 and is indeed 

This closed-form representation justifies the introduction of the latent vari¬ 
able z in the simulation process and leads to the Gibbs sampler that simulates 
P given z and z given /? and y as in 


Zi\y^,P 


^ (x*'^/?, 1,0) if j/i = 1 
.AG (x*"^/?, 1,0) if j/i = 0 


(4.2) 


where {y, 1,0) and MG {y, 1,0) are the normal distributions with mean y 
and variance 1 that are left-truncated and right-truncated at 0, respectively. 

A R code of this sampler is available as follows (based on a call to the R 
library truncnorm): 

gibbsprobit=function(niter,y,X){ 
p=dim(X)[2] 

beta=matrix(0.niter,p) 
z=rep(0,length(y)) 

mod=summary(glm(y~-1+X,f ainily=binomial(link="probit"))) 
beta[l,]=as.vector(mod$coefficient[,1] ) 

Sigma2=solve (t (X) °A* 7 X) 
for (i in 2:niter){ 
mean=X7,*7obeta[i-l,] 

z[y==l]=rtruncnorm(sum(y==l),a=0,b=Inf.mean[y==l] .sd=l) 
z [y==0]==rtruncnorm(sum(y==0).a=-Inf.b=0.meatn [y==0].sd=l) 
Mu=Sigma27o*7.t (X) 7.*7oZ 
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beta[i,]=rmvn(1,Mu,Sigma2) 

} 

beta 

> 


4.61 in this manual. Note 


4.5 in this manual. (This 


The output of this function is represented on Figure 
that the output is somehow smoother than on Figure 
does not mean that the Gibbs sampler is converging faster but rather than 
its component-wise modification of the Markov chain induces slow moves and 
smooth transitions.) 

When comparing the computing times, the increase due to the simulation 
of the Zi’s is not noticeable: for the bank dataset, using the above codes re¬ 
quire 27s and 26s over 10, 000 iterations for hmf latprobit and gibbsprobit. 
respectively. 


4.12 For the bank dataset and the probit model, compute the Bayes factor 
associated with the null hypothesis iJg ■ 1^2 = Ps = 0 - 


The Bayes factor is given by 

^-'-/^T((2fc-l)/4) 

01 7r-('=-2)/2r{(2fc-5)/4} 

[1 - g>(x^V)] d/3 

For its approximation, we can use simulation from a multivariate normal as 
suggested in the book or even better from a multivariate £7 : a direct adapta¬ 
tion from the code in hmnoinfprobit is 

noinfprobit=hmnoinfprobit(10000,y,X,1) 

1ibrary(mnormt) 

mkprob=apply(noinfprobit,2.mean) 
vkprob=var(noinfprobit) 
simk=rmvnorm(100000,mkprob,2*vkprob) 
usk=probitnoinfIpost(simk,y,X)- 

dmnormCsimk,mkprob,2*vkprob,log=TRUE) 

noinfprobitO=hmnoinfprobit(10000,y,X[,c(1,4)],1) 
mk0=apply(noinfprobitO,2,mean) 
vk0=var(noinfprobitO) 
simk0=rmvnorm(l00000,mk0,2*vk0) 
usk0=probitnoinfIpost(simkO,y,X[,c(l,4)])- 
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Fig. 4.6. bank: estimation of the probit coefficients [including one intercept po] by 
a Gibbs sampler 4.2 under a flat prior. Left: pfs {i = 0,... ,4); center: histogram 
over the last 9,000 iterations; right: anto-correlation over the last 9, 000 iterations. 


dmnormCsimkO,mkO,2*vk0,log=TRUE) 
bfOprobit=mean(exp(usk))/mean(exp(uskO)) 

(If a multivariate is used, the dmnorm function must be replaced with 
dt the density of the multivariate /7.) The value contained in bfOprobit is 
67.74, which is thus an approximation to [since we divide the approxi¬ 
mate marginal under the full model with the approximate marginal under the 
restricted model]. Therefore, Hq is quite unlikely to hold, even though, inde¬ 
pendently, the Bayes factors associated with the componentwise hypotheses 
Hq : /32 = 0 and Hq : = 0 support those hypotheses. 
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4.13 In the case of the logit model-i.e., when pi = expx®^/3/{l +expx®^/3} 
(1 < f < A:)-derive the prior distribution on /3 associated with the prior 4.6 on 

The only difference with Exercise |4.11| is in the use of a logistic density, 
hence both the Jacobian and the probabilities are modified: 

A exp({it:,g, - l}x*T^) exp(x*T^) 

Trifj] OC I I 7^ n cy 

{1 + exp(x*'''y3)} ' {1 + exp(x*'''/3)} 

exp I ^ 

^ \i=l _ 

k 

n {l + exp(x*T^)}'"’ 

i=l 



4.14 Examine whether or not the sufficient conditions for propriety of the pos¬ 


terior distribution found in Exercise 4.9 for the probit model are the same for the 
logit model. 


There is little difference with Exercise 4.8 because the only change is [again] 
in the use of a logistic density, which has asymptotics similar to the normal 
density. The problem at /3 = 0 is solved in the same manner. 


4.15 For the bank dataset and the logit model, compute the Bayes factor as¬ 
sociated with the null hypothesis Hq '■ /32 = Ps = 0 and compare its value with 


the value obtained for the probit model in Exercise 4.12 


This is very similar to Exercise |4.12| except that the parameters are now 
estimated for the logit model. The code is provided in bayess as 

# noninformative prior auid random walk HM sample 
noinflogit=hmnoinflogit(10000,y,X,1) 


# log-marginal under full model 
mklog=apply (no inf logit, 2, meain) 
vklog=var(noinflogit) 
simk=rmnorm(100000,mklog,2*vklog) 
usk=logitnoinfIpost(simk,y,X)- 

dmnorm(simk,mklog,2*vklog,log=TRUE) 


# noninformative prior auid random walk HM sample 
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# for restricted model 

noinflogit0=hmnoinflogit(10000,y,X[,c(l,4)] ,1) 

# log-marginal under restricted model 
mk0=apply(noinflogitO,2,mean) 
vk0=var(noinflogitO) 

simkO=rmnorm(100000,mkO,2*vkO) 
usk0=logitnoinfIpost(simkO,y,X[,c(l,4)])- 
dmnorm(simk0,mk0,2*vk0,log=TRUE) 

bf01ogit=mean(exp(usk))/mean(exp(usk0)) 

The value of bfOlogit is 127.2, which, as an approximation to argues 
rather strongly against the null hypothesis Hq. It thus leads to the same con¬ 
clusion as in the probit model of Exercise |4.12[ except that the numerical value 
is almost twice as large. Note that, once again, the Bayes factors associated 
with the componentwise hypotheses Hq : ^2 = 0 and Hq : /I 3 = 0 support 
those hypotheses. 

4.16 Given a contingency table with four categorical variables, determine the 
number of submodels to consider. 


Note that the numbers of classes for the different variables do not matter 
since, when building a non-saturated submodel, a variable is in or out. There 
are 

1 . 2^ single-factor models [including the zero-factor model]; 

2 . ( 2 ® — 1 ) two-factor models [since there are ( 2 ) = 6 ways of picking a pair 
of variables out of 4 and since the complete single-factor model is already 
treated]; 

3. (2"^ — 1) three-factor models. 

Thus, if we exclude the saturated model, there are 2® -|- 2^ — 2 = 94 different 
submodels. 


4.17 In the case of a 2 x 2 contingency table with fixed total count n = 
nil + ni 2 -1-7121 +7122, we denote by dn, 0 i 2 , ^ 21 , ^22 the corresponding proba¬ 
bilities. If the prior on those probabilities is a Dirichlet .^ 4 ( 1 / 2 ,..., 1 / 2 ), give the 
corresponding marginal distributions of a = On -|- 812 and (3 = On + 02 i- Deduce 
the associated Bayes factor if Hq is the hypothesis of independence between the 
factors and if the priors on the margin probabilities a and [3 are those derived 
above. 

A very handy representation of the Dirichlet Vu^di,... ,5k) distribution is 
that 
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when 

Therefore, if 

( 011,^12 


( 6 i ■ ■■ ,^k) 

Cl + ■ • ■ + ?fc) 




Ci ~ '^a{6i, 1), i = 


721 


a X (Ci1jCi 2,C21,C22) iidf^ n/ 11 


then 


and 


(dll + di 2 , 6*21 + O22) = 


(Cll + Cl2; C2I + C22) 
Cii + C12 + C21 + C22 


(Cii + C12), (C21 + C22) ~ 1 ) 


implies that a is a i^e(l, 1) random variable, that is, a uniform ^(01,) vari¬ 
able. The same applies to /3. (Note that a and /3 are dependent in this repre¬ 
sentation.) 

Since the likelihood under the full model is multinomial. 


e{9\T) = ( "" ^^ 2 '^ 2 T 

Vniini 2 n 2 i/ 

where T denotes the contingency table [or the dataset {nn,ni 2 ,n 2 i, 71 . 22 }], 
the [full model] marginal is 


m(T) 


n 

n il 1112 1121 

TT^ 


/ 


^riii-1/2 
'7ll 


^" 12 -V 2 
'7i2 


^" 21-72 

'721 


'722 


dd 


\nii ni2 n2i/ ij _ 

P r(n + 2 ) 

\nii ni 2 n 2 i/ i,j _ 

TT^ (n-l- 1 )! 

1 -pr rijlij + 1/2) 
in+l)TT^ r{nij + 1 ) 


where the term comes from ^( 1 / 2 ) = 

In the restricted model, dn is replaced with afi, 0i2 by a(l — (i), and so 
on. Therefore, the likelihood under the restricted model is the product 


n 

ni. 


Ill - /'I \n—n-i. 

a ^ [1 — a) ^ X 


n 

n.i 
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where ni. = nn +ni2 and n.i = nn +n2i, and the restricted marginal under 
uniform priors on both a and (3 is 


mo{T) 


/ 77 \ / T} \ 

] a”i (lda / /3” Hl-/3)"-”'^d/3 

\ni.J \n.ij Jq Jq 

f ^ \ f \ (’^ i - + ~ ^ 1 - + !)• (’^•1 + ~ ^1 + !)• 

\ni./ \n.i/ (n + 2)! (n + 2)! 

(ni- + l)(n — ni- + 1 ) (n.i + l)(n — n.i + 1) 

(n + 2)(n + l) (n + 2)(n + l) 


The Bayes factor is then the ratio mo{T)/m(T). 



5 


Capture—Recapture Experiments 


5.1 Show that the posterior distribution 7 r(7V|n+) given by (5.1), while asso¬ 
ciated with an improper prior, is defined for all values of n’*'. Show that the 
normalization factor of (5.1) is n’*' V 1, and deduce that the posterior median is 
equal to 2{n'^ V 1) — 1. Discuss the relevance of this estimator and show that it 
corresponds to a Bayes estimate of p equal to 1 / 2 . 

Since the main term of the series is equivalent to the series converges. 

The posterior distribution can thus be normalised. Moreover, 


E 

i=no 


1 

i{i + 1) 



1111 
no no + I no -h 1 no -h 2 
1 

no ' 


Therefore, the normalisation factor is available in closed form and is equal to 
n+ V 1. The posterior median is the value N* such that tt{N > A^*|n+) = 1/2, 

i.e. 

00 

= 1/2 l/n+Vl = I/W* , 

i=N* 

which implies that N* = 2{n'^ V 1). This estimator is rather intuitive in 
that E[n+|iV,p] = pN: since the expectation of p is 1/2, E[n+|A^] = ^/2 and 
N* = 2n''' is a moment estimator of N. 


5.2 Under the same prior as in Section 5.2.1, derive the marginal posterior 
density of N in the case where nf ~ 3§{N,p) and 
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are observed (the later are in fact recaptures). Apply to the sample 


{nj 


+ ^ + 


, /£,2 7 • • • 5 '<'11 


) = ( 32 , 20 , 8 , 5 , 1 , 2 , 0 , 2 , 1 , 1 , 0 ). 


which describes a series of tag recoveries over 11 years. 


In that case, if we denote + ■ ■ ■+n'^ the total number of captures, 

the marginal posterior density of N is 

m 


Tr{N\n+,...,n^) oc "" N -'Ijv>„+ 


{N-n+y. 


t + ---+n+ _ p-^N-n+ + (ni+-n+ + -+n+ -n+ 


dp 


oc 




{N-n+y -wo 
(A^-l)! {N + kn+-n+y. 


dp 


I, 


(AT-n)^)! {N + kn+ + iy ’ 

which does not simplify any further. Note that the binomial coefficients 


n+ 


(J > 2) 


are irrelevant for the posterior of N since they only depend on the data. 
The R code corresponding to this model is as follows: 

nl=32 

ndo=sum(32,20,8,5,1,2,0,2,1,1,0) 


# unnormalised posterior 
post=function(N){ 

expClfactorial(N-l)+lfactorial(N+ll+nl-ndo)- 
Ifactorial(N-nl)-Ifactorial(N+ll*nl+l)) 

} 

# normalising constant auid 

# posterior mean 

posv=post((nl:10000)) 
cons=sum(posv) 

pmecLn=sum( (nl: 10000) *posv) /cons 
pmedi=sum(cumsum(posv)<.5*cons) 
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The posterior mean is therefore equal to 282.4, while the posterior median is 
243. Note that a crude analysis estimating^ by p = {n^ + ■ ■ .+nii)/(10n]^) = 
0.125 and N by rii jp would produce the value N = 256. 

5.3 Show that the conditional distribution of m 2 conditional on both sample 
sizes ni and 712 is given by (5.2) and does not depend on p. Deduce the expectation 

E’"[m2|ni,n2,iV]. 


Since 


mii§{N,p), m2|ni ~ ^(ni,p) 


and 


712 - m2|77i, m 2 ~ ^{N - ni,p ), 
the conditional distribution of m2 is given by 

/(m2|71i, 712 ) OC f - p)N-n,-n,+m, 

— \ 772 - TO2 / 


OC 


OC 


\m2 

/ 771 
V?T72 
Til 

m2 

/ Til 

W2 


N-m 

772 — 1712 
N — Til 
772 — 1712 
N — Til 
772 - 1712 


m2+n2—m2 


(1 -p) 


ni—m2-\-N—ni—n2-\-m2 


which is the hypergeometric Jif{N, 112 , 711 /N) distribution. Obviously, this 
distribution does not depend on p and its expectation is 


E[m2|77l,7l2] = 


77 1 772 

N 


5.4 In order to determine the number N of buses in a town, a capture-recapture 
strategy goes as follows. We observe 771 = 20 buses during the first day and keep 
track of their identifying numbers. Then we repeat the experiment the following 
day by recording the number of buses that have already been spotted on the 
previous day, say m2 = 5, out of the 712 = 30 buses observed the second day. For 
the Darroch model, give the posterior expectation of N under the prior 7r(iV) = 
1/N. 


Using the derivations of the book, we have that 

7r(iV|r7i, 772, 1712 ) OC ^ + l,2N - rf + l)Iiv>„+ 


(iV-1)! (2fV-77^)! 


(IV- 77 +)! (2iV + l)! 


N>n+ 


OC 
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with n~^ = 45 and n‘^ = 50. For n~^ = 45 and n° = 50, the posterior mean is 
equal to 130.91. 

5.5 Show that the maximum likelihood estimator of N for the Darroch model 
is iV = ni/ {m 2 /n 2 ), and deduce that it is not defined when m 2 = 0. 


The likelihood for the Darroch model is proportional to 

{N-n,)\ {N-n+)\ 

= TIC - im -T7i- ^N>n+ 


Since 


for 


{N-n2)l Nl 
£{N + 1) {N+ l-ni)iN+l-n2) 


e{N) 


(iV + l-n+)(iV + l) 


> 1 


(TV + 1)2 - (TV + l)(ni + na) + nma > (A^ + 1)^ - {N + l)n+ 
(TV + l)(ni + 712 — n'^) > nin 2 


(TV+1) < 


711772 

m 2 


the likelihood is increasing for TV < nin2/m2 and decreasing for TV > 
771772/7772. Thus TV = 771772/7772 is the maximum likelihood estimator [assum¬ 
ing this quantity is an integer]. If m 2 = 0, the likelihood is increasing with TV 
and therefore there is no maximum likelihood estimator. 


5.6 Give the likelihood of the extension of Darroch’s model when the capture- 
recapture experiments are repeated K times with capture sizes and recapture 
observations rik (1 < fc < K) and 777 ^ {2 < k < K), respectively. {Hint: Exhibit 
first the two-dimensional sufficient statistic associated with this model.) 


The likelihood for the Darroch model is proportional to 


Since 


for 


e{N) 


(TV-77i)! (TV-77+)! 

(TV- 772 )! TVi 


lAr>n+ ■ 


i{N+l) (TV-bl-77l)(TV-bl- 772) ^ 
^(TV) “ (TV-bl-77+)(TV-bl) - 


(TV-f 1)2 


(TV -f l)(77i -h 772 ) -b 77 1 772 > (TV -|- 1)2 
(TV -I- l)(77l -I- 772 - 77+) > 77l772 
/AT ^ nin2 


{N + 1)77+ 
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the likelihood is increasing for N < nin2/m2 and decreasing for N > 
nxn^jfTil. Thus iV = nin2/Tn2 is the maximum likelihood estimator [assum¬ 
ing this quantity is an integer]. If m2 = 0, the likelihood is increasing with N 
and therefore there is no maximum likelihood estimator. 


5.7 Give both conditional posterior distributions involved in Algorithm 5.8 in 
the case = 0. 

When = 0, there is no capture at all during both capture episodes. 
The likelihood is thus (1 — and, under the prior tt{N,p) = 1/N, the 
conditional posterior distributions of p and N are 


p\N,n+ = 0~^e(l,2Af-hl), 


N\p, = 0 ‘ 


(1 -P) 
N 


2N 


That the joint distribution 7r(A^, p|n+ = 0) exists is ensured by the fact that 
Tr{N\n~^ = 0) oc 1/A^(2A^ -|- 1), associated with a converging series. 


5.8 Show that, for the two-stage capture model with probability p of capture, 
when the prior on is a ^{X) distribution, the conditional posterior on N — n'^ 
is — p)^). 


The posterior distribution of {N,p) associated with the informative prior 
Tr(N,p) = X^e~^/N\ is proportional to 


Nl 


{N-n+)lNl 


(i-p) 


2N-n‘‘ 


■N>n+ • 


The corresponding conditional on N is thus proportional to 


\N \N-n+ 

(N-n+)l (jv-n+)! 

which corresponds to a Poisson ^{X{1 — p)^) distribution on — n+. 

5.9 Reproduce the analysis of eurodip summarized by Figure 5.1 when switching 
the prior from Tr{N,p) oc X^/N\ to tt{N,p) oc A^“^. 


The main purpose of this exercise is to modify the code provided in the 
book (p.l51) and in the demo for Chapter 5, since the marginal posterior 
distribution of N is given in the book as 
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Tr{N\n'^ 


n‘^) cx 


(A^-l)! (TN-n^^y. 
{N -n+)\ “(TiV + 1)1' 


lAr>n+Vl • 


(The conditional posterior distribution of p does not change.) This distribution 
being non-standard, it makes direct simulation awkward and we prefer to use 
a Metropolis-Hastings step, using a modified version of the previous Poisson 
conditional as proposal q{N'\N,p). We thus simulate 


N* -n+ r-. ^ 


and accept this value with probability 

7 r(iV*|n+,n‘=) 

7r(iVb“i) |n+, n'^) 

The corresponding modified R function is 

gibbs1l=function(nsimu,T,nplus,nc) 

{ 

# conditional posterior 
rati=function(N){ 

Ifactorial(N-l)+lfactorial(T*N-nc)- 
Ifactorial(N-nplus)-Ifactorial(T*N+1) 

} 


N=rep(0,nsimu) 
p=rep(0,nsimu) 

N [1]=2*nplus 

p [1]=rbeta(l,nc+l,T*N[1]-nc+1) 
for (i in 2:nsimu){ 


# MH step on N 
N[i]=N[i-l] 

prop=nplus+rpois(1,N[i-1]*(1-p[i-1])~T) 
if (logCrunif(1))<rati(prop)-rati(N[i])+ 

dpois(N[i-1]-nplus.prop*(1-p[i-1])”T,log=T)- 
dpois(prop-nplus,N[i-1]*(1-p[i-1])”T,log=T)) 
N [i]=prop 

p[i]=rbeta(l,nc+l,T*N[i]-nc+1) 

> 

list(N=N,p=p) 

} 

The output of this program is given in Figure [5T| 
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Fig. 5.1. eurodip: MCMC simulation under the prior tt{N,p) oc ^ 


5.10 An extension of the T-stage capture-recapture model of Section 5.2.3 
is to consider that the capture of an individual modifies its probability 
of being captured from p to q for future recaptures. Give the likelihood 

i{N,p, g|ni, 712, m2 mr). 


When extending the T-stage capture-recapture model with different prob¬ 
abilities of being captured and recaptured, after the hrst capture episode, 
where ?ii ~ ■S§{N,p), we observe T — 1 new captures (i = 2,..., T) 

Hi — mi|ni,712,m2, ■ • •, ti^-i, m^-i ~ SS{N — m — n2 + m 2 + ■ ■ ■ + mi-i,p) , 

and T — 1 recaptures (i = 2,..., T), 

mi|7ii,7i2,m2,... ,7ij_i,mi_i ~ .^(tii -|- ?i2 - m 2 -h ... - m^-i, g). 


The likelihood is therefore 

T 




i=2 


N — Til + — mi_i 

71,; - m. 






T 

n 

i=2 

N\ 


rii -f ?i2 — . •. — mi_i 
mi 




711 + ... —mi 


{N-n+)l 


p^* {1 - pf^-^‘ {1 - qY 


where ?i+ = 711 — m2 -!-••• — itit is the number of captured individuals. 
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T 

n* = Tm + — j + l)(«i — Wj) 

i=2 

and where to"'' = mi + - • ■+mT is the number of recaptures. The four statistics 
(ui,n"*',n*,771“'") are thus sufficient for this version of the T-stage capture- 
recapture model. 

5.11 Another extension of the 2-stage capture-recapture model is to allow for 
mark loss. If we introduce q as the probability of losing the mark, r as the proba¬ 
bility of recovering a lost mark and k as the number of recovered lost marks, give 
the associated likelihood i{N,p,q,r\ni,n 2 ,rn 2 , k). 


There is an extra-difficulty in this extension in that it contains a latent 
variable: let us denote by z the number of tagged individuals that have lost 
their mark. Then z ^ q) is not observed, while k ~ i^(z, r) is observed. 

Were we to observe (ni, 772, m2, A:, z), the [completed] likelihood would be 


r(iV,p,9,r|ni,n2,m2,fc,z)= q^il-q^' 


Hl-rY 

N — m + z 
772 — m2 


-k rii - z 


P^K^-p) 


Til —z —m 2 


m2 

"*2 _ p^A''-Tii-|-z-n2-|-m2 


since, for the second round, the population gets partitioned into individuals 
that keep their tag and are/are not recaptured, those that loose their tag 
and are/are not recaptured, and those that are captured for the first time. 
Obviously, it is not possible to distinguish between the last two categories. 
Since z is not known, the [observed] likelihood is obtained by summation over 
z: 


iV' 

e{N,p,q,r\ni,n2,Tn2,k) oc —- "(1 

[N - 77i)! 

ni—m2 

E 

z—kVN—ni—n2-\-'m2 

/«-„, +A ,.(1 ,1 

V 772 - m2 / 



p'j2N-ni-n2 


ry-\ 


Note that, while a proportionality sign is acceptable for the computation of the 
likelihood, the terms depending on z must be kept within the sum to obtain 
the correct expression for the distribution of the observations. A simplified 
version is thus 
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Nl 

i{N,p,q,r\ni,n 2 ,m 2 ,k) cx 

ni—m2 

E 

z—kVN—rii — n2+m2 

but there is no close-form solution for the summation over z. 


(N-n,)'. 

{N-ni + z)![g(l - r)/(l - q)Y 
z\{ni - z - m2)!(iV - ni - n2 -|- m2 -|- 2:)! 


5.12 Show that the conditional distribution of ri in the open population model 
of Section 5.3 is proportional to the product (5.4). 

The joint distribution of V* = (ni, C2, C3, ri, r2) is given in the book as 






pC2(l_p)ni-ri-c, 


Therefore, if we only keep the terms depending on ri, we indeed recover 
1 - (ni-ri)! 


ri!(ni - ri)! 

(ni - ri)! 


5-1(1 

(1-9) 


[ni - ri - C2)! 

(ni - ri - r2)! 


ni-ri-C2 


(ni - ri - r2)! 

(ni - ri)! 


ni—ri—r2 


(ni - ri - r 2 - C 3 )! 

9 


_ p^ni-ri-r2-C3 
ri 


OC 


ri!(ni - ri - C 2 )!(ni - ri - r 2 - C 3 )! ( (1 - q)^{l - pY 

ni -C2\ fni-ri\ f _ q _'' 

ri J \r2+cs) - qYil - pY 


under the constraint that ri < min(ni, ni —r2, ni — r2 — C3,711 — 02) = min(ni — 
7-2 - C 3 ,ni - C 2 ). 


5.13 Show that the distribution of 02 in the open population model of Section 
5.3 can be integrated out from the Joint distribution and that this leads to the 
following distribution on ri: 


7r(7’l|p, g,7ll,C2,C3) OC 


(ni - ri)!(ni - ri - C 3 )! 
ri!(ni - ri - C2)! 


« __ 

V(l-p)(l-9)[9 + (l-p)(l-9)] 


ri 


Compare the computational cost of a Gibbs sampler based on this approach with 
a Gibbs sampler using the full conditionals. 
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Following the decomposition of the likelihood in the previous exercise, the 
terms depending on r2 are 


1 


q 


(ni - ri - r2)! 

r2!(ni - ri - r2)! V(1 -p)(l - q) i (n-i - ri - r2 - C3)! 


1 


q 


r2!(ni - ri - r2 - C3)! - p){l - q) 

If we sum over 0 < r2 < ui — ri — C3, we get 


1 


ni—ri—C3 


(ni - ri - C3)! 


E 

+ 


ni-ri- C3 

k 


q 


(1 -p)(l - q) 

ni—ri—cs 


(l-p)(l-g) 
that we can agregate with the remaining terms in ri 


(n - n)! 


q 


to recover 

TT{ri\p,q,ni,C 2 ,C 3 ) ex 


ri!(ni - ri - C 2 )! i (1 - g)2(l -p)2 

(ni - ri)!(ni - ri - C3)! 


ri!(ni - ri - C2)! 


{1 - p){l - q)[q + {I - p){l - q)] 


5.14 Show that the likelihood associated with an open population as in Section 
5.3 can be written as 

T N 

iiN,pm= E nncp-u(i-9.u-p)^-"‘ 

t=li=l 

where go = 9. 9i = 1. and Su and eu are the capture and exit indicators, 
respectively. Derive the order of complexity of this likelihood; that is, the number 
of elementary operations necessary to compute it. 


This is an alternative representation of the model where each individual 
capture and life history is considered explicitely. This is also the approach 
adopted for the Arnason-Schwarz model of Section 5.5. We can thus define 
the history of individual 1 < i as a pair of sequences (ea) and {Sa)-, 

where eu = 1 at the exit time t and forever after. For the model given at 



















5 Capture-Recapture Experiments 


61 


the beginning of Section 5.3, there are ni equal to 1, ri e^i’s equal to 1, 
C2 ^i2’s equal to 1 among the i’s for which Sn = 1 and so on. If we do not 
account for these constraints, the likelihood is of order 0(3^"'") [there are three 
possible cases for the pair (e^t, Su) since Su = 0 if eu = Ij. Accounting for the 
constraints on the total number of equal to 1 increases the complexity of 
the computation. 

5.15 In connection with the presentation of the accept-reject algorithm in Sec¬ 
tion 5.4, show that, for M > 0, \f g \s replaced with Mg in and if {X,U) is 
uniformly distributed on the marginal distribution of X is still g. Deduce that 
the density g only needs to be known up to a normalizing constant. 


The set 

y = {(a;,u) : 0 < u < Mg{x)} 

has a surface equal to M. Therefore, the uniform distribution on has density 
1/M and the marginal of X is given by 

/■_ 1 , Mg{x) 

J A0,Mg(a;)) ~ “ 9^ ' 

This implies that uniform simulation in provides an output from g no mat¬ 
ter what the constant M is. In other words, g does not need to be normalised. 

5.16 For the function (;(x) = (l-|-sin^(x))(2-|-cos^(4a:)) exp[—a;^{l-|-sin®(a;)}] 
on [0,27r], examine the feasibility of running a uniform sampler on the set 
associated with the accept-reject algorithm in Section 5.4. 

The function g is non-standard but it is bounded [from above] by the 
function g{x) = 6exp[—a;"^] since both cos and sin are bounded by 1 or even 
g{x) = 6. Simulating uniformly over the set associated with g can thus be 
achieved by simulating uniformly over the set associated with g until the 
output falls within the set associated with g. This is the basis of accept- 
reject algorithms. 

5.17 Show that the probability of acceptance in Step 2 of Algorithm 5.9 is 
1/M and that the number of trials until a variable is accepted has a geometric 
distribution with parameter 1/M. Conclude that the expected number of trials 
per simulation is M. 

The probability that U < g{X)/{Mf{X)) is the probability that a uniform 
draw in the set 

= {{x,u) : 0 < u < Mg{x)} 
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falls into the subset 


.9’q = {{x,u) : 0 < u < f{x)}. 


The surfaces of and being M and 1, respectively, the probability to fall 
into is 1/M. 

Since steps 1. and 2. of Algorithm 5.2 are repeated independently, each 
round has a probability 1/M of success and the rounds are repeated till the 
first success. The number of rounds is therefore a geometric random variable 
with parameter 1/M and expectation M. 

5.18 For the conditional distribution of at derived from (5.3), construct an 
accept-reject algorithm based on a normal bounding density / and study its 
performances for N = 532, rit = 118, \it = —0.5, and = 3. 


That the target is only known up to a constant is not a problem, as demon¬ 


strated in Exercise 5.20 To find a bound on 7r(at|iV,Ut) [up to a constant], 
we just have to notice that 


(1-Fe“‘)”^ < e-^“* 


and therefore 


(1-Fe“‘) ^ exp jatUt - ^(at - 


< exp \ at{nt - N) - - Htf 

= exp ^ + 2^(/rt - a\N - m)) - 


1 




TTCr 


exp \ -^(«t - A** + cr^(A^ - nt))" 


v^crexp - cr^{N - nt)]^)| . 


The upper bound thus involves a normal — cr^(iV — nt),a'^) distribution 

and the corresponding constant. The R code associated with this decomposi¬ 
tion is 


# constants 
N=53 
nt=38 
mut=-.5 
sig2=3 

sig=sqrt(sig2) 
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# log target 
ta=functioii(x){ 

-N*log(1+exp(x))+x*nt-(x-mut)“2/(2*sig2) 

} 


#bounding constant 
bmecLn=mut-sig2* (N-nt) 

uc=0.5*log(2*pi*sig2)+(bmean"2-mut~2)/(2*sig2) 
prop=rnorm (1, sd=s ig) tbineaii 

ratio=ta(prop) -uc-dnorm (prop, meEin=bmecLn, sd=sig, log=T) 
while (logCrunif(l))>ratio){ 


prop=rnorm(1,sd=sig)+bmean 

ratio=ta(prop)-uc-dnorm(prop,mean=bmean,sd=sig,log=T) 

> 


The performances of this algorithm degenerate very rapidly when N — rit is 
[even moderately] large. 


5.19 When uniform simulation on the accept-reject set of Section 5.4 is 
impossible, construct a Gibbs sampler based on the conditional distributions of u 
and X. {Hint: Show that both conditionals are uniform distributions.) This special 
case of the Gibbs sampler is called the slice sampler (see Robert and Gasella, 


2004, Chapter 8). Apply to the distribution of Exercise 5.16 


Since the joint distribution of {X, U) has the constant density 


t(x, U) llo<u<(;(3:) : 


the conditional distribution of U given X = a: is ^{0,g{x)) and the condi¬ 
tional distribution of X given U = u is {{x; g{x) > u}), which is uniform 
over the set of highest values of g. Both conditionals are therefore uniform 
and this special Gibbs sampler is called the slice sampler. In some settings, 
inverting the condition g{x) > u may prove formidable! 

If we take the case of Exercise 5.16 and of g{x) = exp(—x^), the set 
{x; g{x) > u\ is equal to 


{x;5(x) > u} = |x;x < (-log(x))^/'^| 


which thus produces a closed-form solution. 
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5.20 Show that the normalizing constant M of a target density / can be deduced 
from the acceptance rate in the accept-reject algorithm (Algorithm 5.9 under the 
assumption that g is properly normalized. 

This exercise generalises Exercise |5.17| where the target / is already nor¬ 
malised. 

If f{x) = M f{x) is a density to be simulated by Algorithm 5.9 and if g is 
a density such that 

f{x) < Mg{x) 

on the support of the density g, then running Algorithm 5.9 with an ac¬ 
ceptance probability of g[x)/Mf{x) produces simulations from / since the 
accepted values have the marginal density proportional to 

^ (X f(x) . 

In that case, the average probability of acceptance is 



lx MM 


dx = 


1 

MM ■ 


Since the value of M is known, the average acceptance rate over simulations, 
g, leads to estimate M as 



gM 


5.21 Reproduce the analysis of Exercise 5.20 for the marginal distribution of ri 
computed in Exercise 5.13| 


The only change in the codes provided in demo/Chapter. 5. R deals with 
thresh, called by ardipper, and with gibbs2 where the simulation of r 2 is 
no longer required. 

5.22 Modify the function ardipper used in Section 5.4 to return the acceptance 
rate as well as a sample from the target distribution. 


As provided in Section 5.4, the function ardipper is defined by 
ardipper=function(nsimu=l,nl,c2 ,c3,r2,q2){ 

barr=min(nl-c2,nl-r2-c3) 
boundM=thresh(0,nl,c2,c3,r2,barr) 
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echaii=l :nsimu 
for (i in l;nsimu){ 
test=TRUE 
while (test){ 

y=rbinom(1,size=barr,prob=q2) 
test=(runif(1)>thresh(y,nl,c2,c3,r2,barr)) 
} 

echan[i]=y 

} 


echan 


} 

The requested modification consists in monitoring the acceptance rate and 
returning a list with both items: 

ardippest=function(nsimu=l,nl,c2,c3,r2,q2){ 

barr=min(nl-c2,nl-r2-c3) 
boundM=thresh(0,nl,c2,c3,r2,barr) 
echan=l:nsimu 
acerate=-nsimu 
for (i in l:nsimu){ 
test=TRUE 
while (test){ 

y=rbinom(1,size=barr,prob=q2) 
test=(runif(1)>thresh(y,nl,c2,c3,r2,barr)) 
acerate=acerate+l 
} 

echan[i]=y 

} 

list(scmiple=echan,rej ect=acerate/nsimu) 


} 


5.23 Show that, given a mean and a 95% confidence interval in [0,1], there 
exists at most one beta distribution ^e{a,b) with such a mean and confidence 
interval. 

If 0 < m < 1 is the mean m = a/{a + b) of a beta i^e(a, b) distribution, 
then this distribution is necessarily a beta £§e{am,a{l — m)) distribution, 
with a > 0. For a given confidence interval [£, u], with 0<.^<to<m<1, we 
have that 



[since, when a goes to zero, the mass of the beta 3§e{am, a(l—m)) distribution 
gets more and more concentrated around 0 and 1, with masses (1 — m) and 




66 


5 Capture-Recapture Experiments 


m, respectively] and 


lim 


r{a) 




a-^-oo Jg r{am)r{a{l — m)) 

[this is easily established using the gamma representation introduced in Ex¬ 


ercise 


4.17 and the law of large numbers]. Therefore, due to the continuity [in 


a\ of the coverage probability, there must exist one value of a such that 




i) = 


na) 


If r{am)r(a{l — m) 


-i(l_a;)“(i-™)-ida: = 0.9. 


Figure 5.2 illustrates this property by plotting B{i,u\a,m) for i = 0.1, u = 
0.6, m = 0.4 and a varying from 0.1 to 50. 



Fig. 5.2. Coverage of the interval {£, u) = (0.1,0.6) by a .^e(0.4a, 0.6 q:) distribution 
when a varies. 


5.24 Show that, for the Arnason-Schwarz model, groups of consecutive un¬ 
known locations are independent of one another, conditional on the observations. 
Devise a way to simulate these groups by blocks rather than one at a time; that is, 
using the joint posterior distributions of the groups rather than the full conditional 
distributions of the states. 


As will become clearer in Chapter 7, the Arnason-Schwarz model is a very 
special case of [partly] hidden Markov chain: the locations of an indi¬ 
vidual i along time constitute a Markov chain that is only observed at times 
t when the individual is captured. Whether or not is observed has no 
relevance on the fact that, given ■Z(z,t-2)) ■ • •) is independent 

from (z(j Z(i,t+2), • ■ •)■ Therefore, conditioning on any time t and on the 
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corresponding value of Z(i^t) niakes the past and the future locations indepen¬ 
dent. In particular, conditioning on the observed locations makes the blocks 
of unobserved locations in-between independent. 

Those blocks could therefore be generated independently and parallely, 
an alternative which would then speed up the Gibbs sampler compared with 
the implementation in Algorithm 5.3. In addition, this would bring additional 
freedom in the choice of the proposals for the simulation of the different blocks 
and thus could further increase efficiency. 
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Mixture Models 


6.1 Show that a mixture of Bernoulli distributions is again a Bernoulli distribu¬ 
tion. Extend this to the case of multinomial distributions. 


By definition, if 


X ~ ^Pr-^{qi) , 




then X only takes the values 0 and 1 with probabilities 

k k k 

- Qi) = 1-'^PiQi and y^^p^qi 

i—1 i—1 i—1 

respectively. This mixture is thus a Bernoulli distribution 

^ ) ■ 


When considering a mixture of multinomial distributions, 

k 


2 = 1 


with qi = (^ii,..., x takes the values 1 < j < k with probabilities 


k 

i=l 

and therefore this defines a multinomial distribution. This means that a mix¬ 
ture of multinomial distributions cannot be identifiable unless some restric¬ 
tions are set upon its parameters. 
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6.2 Show that the number of nonnegative integer solutions of the decomposition 
of n into k parts such that ni + ... + Uk is equal to 

fn + k — l\ 

'=( n j- 

Deduce that the number of partition sets is of order {Hint: This is a 

classical combinatoric problem.) 


This is a usual combinatoric result, detailed for instance in Feller (1970). 
A way to show that r is the solution is to use the “bottomless box” trick: 
consider a box with k cases and n identical balls to put into those cases. If we 
remove the bottom of the box, one allocation of the n balls is represented by 
a sequence of balls (O) and of case separations (|) or, equivalently, of O’s and 
I’s, of which there are n and k — 1 respectively [since the box itself does not 
count, we have to remove the extreme separations]. Picking n positions out of 
n + (fc — 1) is exactly t. 

This value is thus the number of “partitions” of an n sample into k groups 
[we write “partitions” and not partitions because, strictly speaking, all sets 
of a partition are non-empty]. Since 

n + k — 1\ {n + k — 1)1 


n\{k-l)\ (fc-1)!’ 

when k, there is indeed an order 0(n^“^) of partitions. 


6.3 For a mixture of two normal distributions with all parameters unknown, 

(MI, CTi) + (1 - P)^ip2, CT 2 ) , 
and for the prior distribution (j = 1,2) 

~ cr|/nj), cr| ~ J^{vj/2, ^^2) , p ~ ^e{a, /3), 


show that 


pjx, z ~ S^e{a -F 4, /3 -F 4), 


/rj|crj,x,z ~ .yK ( ^i(z), —] , cr|lx, z ~ -F £j)/2, Sj(z)/2), 


nj -F 


where is the number of Zi equal to j, Xj{’L) and Sj(z) are the empirical mean 
and variance for the subsample with Zi equal to j, and 


= 


nj^j+£jXj(z) 

rij + (.j 


: Sj (z) = s ■ + £j (z) -F (Cj -Xj{z))'^ 


rijij 


rij -F Cj 


Compute the corresponding weight uj(z). 
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If the latent (or missing) variable z is introduced, the joint distribution of 
(x, z) [equal to the completed likelihood] decomposes into 


Y[PzJixi\9^,) = Yl n 


i-1 




o^yip/ n 

i=l i;zi=j 


( 6 . 1 ) 


where pi = p and p 2 = (1 — p). Therefore, using the conjugate priors proposed 
in the question, we have a decomposition of the posterior distribution of the 
parameters given (x, z) in 


/i+«- 1(1_P)^2+/3 


-1 


n n 

J = 1 l\Zi=j 


o f^j) 


-7r(pj,crf) . 


This implies that pjx, z ~ .^^e(a + £i,/3 + I 2 ) and that the posterior distri¬ 
butions of the pairs are the posterior distributions associated with 

the normal observations allocated (via the Zi^s) to the corresponding compo¬ 
nent. The values of the hyperparameters are therefore those already found in 
Chapter 2 (see, e.g., Exercises 2.7 and 2.15). 

The weight u;(z) is the marginal [posterior] distribution of z, since 

7r(0,p|x) = ^w(z)7r(0,p|x,z) . 


Therefore, if pi = p and p 2 = 1 — p, 




UJ 


»«/ Y[p/ n 

C(a -I- l'i)T'(/3 + £ 2 ) 
T(q!-I-^-I- n) 

2 


7r(0,p) d0dp 


]Jexp 

1=1 


-1 


^ WiPj - 0(z))^ + si(z)} 


— ij —tPj —3 


de 


T(a + £i)r(/3 + £2) T((£,- + ^,-)/2)(s,-(z)/2)(-^-+^^-)/^ 


r{a +13 + ri) 


1=1 


V^i+Tj 


and the proportionality factor can be derived by summing up the rhs over all 
z’s. (There are 2" terms in this sum.) 


6.4 For the normal mixture model of Exercise 6.3 compute the function Q{0q, 6) 
and derive both steps of the EM algorithm. Apply this algorithm to a simulated 
dataset and test the influence of the starting point Oq. 
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Starting from the representation ( 6 . 1 ) above, 


log£(0,p|x,z) ='^{li{zi)log{p fixi\ 0 i) +hizi)log{il - p) f{x^\ 92 )} 
2=1 


which implies that 

E(e<*),p(t)) [log£(0,p|x,z)|x] 

n 

= X! = l|x)log(p/(x,|0i) 

2 = 1 

+P(e(‘).p(‘)) [zi = 2|x) log((l - p) /(a;,|6>2)| 

n 

= log(pM)^P(e(0,p(t)) [Zi = l|x) 

2 = 1 

71 

+ log((l - p)/(J2) P(e(t),p(0) {zi = 2|x) 

2=1 

- ^P(e<‘>.pW) = 1|^) 

i=i 1 

-Wp (z -2lx')t^2— 

/ . \Z^t ~ ^ 1 ^/ 20.2 


If we maximise this function in p, we get that 
1 ” 

^ 2 = 1 

^ 1 _ 

^ i=i + (1 -p(‘))/(x*|e^‘^) 

while maximising in {pj,aj) (j = 1,2) leads to 


n / '^ 

= XI ^(e'‘bp<*b / X^(0<‘bp(‘>) 

2 = 1 ' 2=1 

Xipf j{xi\e^^) 


E 


n / ^ 

2{t+\) ^ (a;. _^(‘+l))2 / ^P(g(t)^p(0) (z* = j| 

2=1 ' 2=1 


E 


a;, - M} 


(t+i) 
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where pf ^ and = (1 — 

A possible implementation of this algorithm in R is given below: 

# simulation of the dataset 
n=324 

tz=sample(l:2,n,prob=c(.4,.6),rep=T) 
tt=c(0,3.5) 
ts=sqrt(c(1.1,0.8)) 
x=rnorm(n,mean=tt[tz],sd=ts[tz]) 

para=matrix(0,ncol=50,nrow=5) 
likem=rep(0,50) 

# initial values chosen at random 

para[,1]=c(runif(1),mecLn(x)+2*rnorm(2)*sd(x),rexp(2)*var(x)) 
likem[1]=sum(log( para[l,1]*dnorm(x,mean=para[2,1], 

sd=sqrt(para[4,1]))+(1-para[1,1])*dnorm(x,mean=para[3,1], 
sd=sqrt(para[5,1])) )) 

# 50 EM steps 
for (em in 2:50){ 

# E step 

postprob=l/( l+(l-para[1,em-1])*dnorm(x,mean=para[3,em-1] , 
sd=sqrt(para[5,em-1]))/( para[1,em-l]*dnorm(x, 
mean=para[2,em-1],sd=sqrt(para[4,em-1]))) ) 

# M step 

para[1,em]=mean(postprob) 

para[2,em]=mean(x*postprob)/para[1,em] 

para[3,em]=mean(x*(1-postprob))/(1-para[1,em]) 

para[4,em]=mean((x-para[2,em])~2*postprob)/para[1,em] 

para[5,em]=mean((x-para[3,em])~2*(1-postprob))/(l-para[l,em]) 

# value of the likelihood 

likem[em]=sum(log(para[1,em]*dnorm(x,mecLn=para[2,em], 

sd=sqrt(para[4,em]))+(l-para[l,em])*dnorm(x,mecLn=para[3,em], 
sd=sqrt(para[5,em])) )) 

} 

Figure |6.1| in this manual in this manual represents the increase in the 
log-likelihoods along EM iterations for 20 different starting points [and the 
same dataset a:] . While most starting points lead to the same value of the log- 
likelihood after 50 iterations, one starting point induces a different convergence 
behaviour. 
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iterations 

Fig. 6.1. Increase of the log-likelihood along EM iterations for 20 different starting 
points. 


6.5 In the mixture model with independent priors on the 0j's, show that the 
9j’s are dependent on each other given (only) x by summing out the z's. 

The likelihood associated with model (6.2) being 


i{e,p\x) 


n 


k 

E 




it is clear that the posterior distribution will not factorise as a product of 
functions of the different parameters. It is only given (x, z) that the 0j’s are 
independent. 

6.6 Construct and test the Gibbs sampler associated with the (■C,Mo) parame¬ 
terization of (6.3), when = pQ — ^ and p 2 = fJ-o + C 

The simulation of the z^’s is unchanged [since it does not depend on the 
parameterisation of the components. The conditional distribution of i^,p-o) 
given (x, z) is 

7r(C, Aio|x, z) oc exp ^ ^ {xi - Mo + + E 

[zi^l Zi^2 



Therefore, ^ and independent given (x,z), with 
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Aio|?,x,z - ^ 
CIMo,x,z - ^ 


nx 


+ (^ 1 -^ 2 )^ 1 


n n 

J2zi=2i^^ - Mo) - - Mo) 1 

n ’ n 


The implementation of this Gibbs sampler is therefore a simple modifica¬ 
tion of gibbsmean in the bayess: the MCMC loop is now 

for (t in 2:Nsim){ 


# allocation 

fact=.3*sqrt(exp(gul"2-gu2"2))/.7 
probs=l/(1+fact*exp(scmipl*(gu2-gul))) 
zeds=(runif(N)<probs) 

# Gibbs sampling 

muO=rnorm(l)/sqrt(N)+(sum(sampl)+xi*(sum(zeds==l) 
-sum(zeds==0)))/N 

xi=rnorm(l)/sqrt(N)+(sum(sainpl[zeds==0]-muO) 
-sumCsampl[zeds==l]-muO))/N 

# reparameterisation 
gul=muO-xi 
gu2=mu0+xi 

muz[t,]=(c(gul,gu2)) 


} 

If we run repeatedly this algorithm, the Markov chain produced is highly 
dependent on the starting value and remains captive of local modes, as illus¬ 
trated on Figure |6?2] in this manual. This reparameterisation thus seems less 
robust than the original parameterisation. 

6.7 Show that, if an exchangeable prior tt is used on the vector of weights 
{pi,... ,Pk), then, necessarily, = 1/k and, if the prior on the other pa¬ 

rameters ( 01 ,..., 0fc) is also exchangeable, then 'E'^[pj\xi, ..., a;„] = 1/k for all 

j's- 


If 


7r(pi, . ..,Pk)= T^{Pa(l),- ■ ■ ,Pa(k)) 


for any permutation a G &k, then 



.,Pj,...,Pfc)dp 


J dp = E’"[pi]. 
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Fig. 6.2. Influence of the starting value on the convergence of the Gibbs sampler 
associated with the location parameterisation of the mean mixture (10,000 itera¬ 
tions). 
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Given that this implies = l/k. 

When both the likelihood and the prior are exchangeable in {pj ,9j), the 
same result applies to the posterior distribution. 

6.8 Show that running an MCMC algorithm with target 7 r( 0 |x)''' will increase 
the proximity to the MAP estimate when 7 > 1 is large. {Note: This is a crude 
version of the simulated annealing algorithm. See also Chapter]^) Discuss the 
modifications required in Algorithm 6.11 to achieve simulation from 7 r( 0 |x)''' when 
7 G N* is an integer. 

The power distribution TT.y{0) oc Tr{0)'^ shares the same modes as it, but the 
global mode gets more and more mass as 7 increases. If 9* is the global mode of 
TT [and of TT^], then {tt{9)/'k{9*)}'^ goes to 0 as 7 goes to 00 for all 9's different 
from 9*. Moreover, for any 0 < a < 1, if we define the a neighbourhood Tla 
of 9* as the set of 0’s such that 7 r( 0 ) > a7r(9*), then 7 r.^( 9 lQ,) converges to 1 
as 7 goes to 00 . 

The idea behind simulated annealing is that, first, the distribution TT-y{9) oc 
7 r( 0 )''' is more concentrated around its main mode than 7 r( 0 ) if 7 is large and, 
second, that it is not necessary to simulate a whole sample from 7 r( 0 ), then a 
whole sample from 7 r( 0 )^ and so on to achieve a convergent approximation of 
the MAP estimate. Increasing 7 slowly enough along iterations leads to the 
same result with a much smaller computing requirement. 

When considering the application of this idea to a mean mixture as (6.3) 
[in the book], the modification of Algorithm 6.2 is rather immediate: since 
we need to simulate from 7 r( 0 ,p|x)'>' [up to a normalising constant], this is 
equivalent to simulate from £{6,p\x)'^ x 7 r( 0 ,p)^. This means that, since the 
prior is [normal] conjugate, the prior hyperparameter A is modified into 7 A 
and that the likelihood is to be completed 7 times rather than once, i.e. 

i{e,p\xy = fj f{x,z\e,p)dz\ =Y[J /(x,zj| 0 ,p)dzj ■ 

Using this duplication trick, the annealed version of Algorithm 6.2 writes as 


Algorithm 6.1 Annealed Mean Mixture Gibbs Sampler 

Initialization. Choose and 
Iteration t {t > 1). 

1. For j = 1,..., n, j = 1,..., 7 , generate from 

¥{zij = 1) (xp exp I 

¥{zij = 2) oc (1 -p) exp [x^ - | 
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2. Compute 


j—li—l 3—^ i—1 


3. Generate from 


4. Generate ^ 2 ^ from jY 


jy 


1 


\ 7 A + £ 7 A + £ 
7A(5 + (z) 1 


7 A + 7 n — £ 7 A + 771 — £ 


This additional level of completion means that the Markov chain will have 
difficulties to move around, compared with the original Gibbs sampling algo¬ 
rithm. While closer visits to the global mode are guaranteed in theory, they 
may require many more simulations in practice. 

6.9 Show that the ratio (6.7) goes to 1 when a goes to 0 when the proposal q 
is a random walk. Describe the average behavior of this ratio in the case of an 
independent proposal. 


Since 

M 1-9) ^ 9{l-9) ’ 

the Metropolis-Hastings acceptance ratio for the logit transformed random 
walk is 

6.10 If one needs to use importance sampling weights, show that the simul¬ 
taneous choice of several powers a requires the computation of the normalizing 
constant of tTq,. 

If samples from several tempered versions tTq, of tt are to be 

used simultaneously, the importance weights associated with those samples 
'x{9ia)/'Xoi{0ict) require the computation of the normalizing constants, which 
is most often impossible. This difficulty explains the appeal of the “pumping 
mechanism” of Algorithm 6.5, which cancels the need for normalizing con¬ 
stants by using the same tTq, twice, once in the numerator and once in the 
denominator. 
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6.11 In the setting of the mean mixture (6.3), run an MCMC simulation exper¬ 
iment to compare the influence of a ,7^(0,100) and of a ^( 0 , 10000 ) prior on 
(/tiif', 2 ) on a sample of 500 observations. 

The power distribution TT-y{9) oc 7 r( 0 )''' shares the same modes as tt, but the 
global mode gets more and more mass as 7 increases. If 9* is the global mode of 
TT [and of TT^], then {tt{9)/'k{9*)}'^ goes to 0 as 7 goes to 00 for all 9's different 
from 9* . Moreover, for any 0 < a < 1, if we define the a neighbourhood OIq 
of 9* as the set of 9’s such that 7r(d) > O7r(0*), then converges to 1 

as 7 goes to 00 . 

The idea behind simulated annealing is that, first, the distribution Tr-y{9) oc 
tt{9)'^ is more concentrated around its main mode than tt{9) if 7 is large and, 
second, that it is not necessary to simulate a whole sample from tt{9), then a 
whole sample from tt{9)^ and so on to achieve a convergent approximation of 
the MAP estimate. Increasing 7 slowly enough along iterations leads to the 
same result with a much smaller computing requirement. 

When considering the application of this idea to a mean mixture as (6.3) 
[in the book], the modification of Algorithm 6.2 is rather immediate: since 
we need to simulate from 7r(0,p|x)'>' [up to a normalising constant], this is 
equivalent to simulate from £(0,p|x)'>' x 7 r( 0 ,p)^. This means that, since the 
prior is [normal] conjugate, the prior hyperparameter A is modified into 7 A 
and that the likelihood is to be completed 7 times rather than once, i.e. 



Using this duplication trick, the annealed version of Algorithm 6.2 writes as 


Algorithm 6.2 Annealed Mean Mixture Gibbs Sampler 


Initialization. Choose and 
Iteration t {t> 1). 

1. For f = 1,..., n, j = 1,..., 7 , generate from 



2. Compute 


7 n 


7 n 



and 
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3. Generate from jV 

4. Generate from jV 


1 


7 A(5 + barxi (z) 

'yX + £ ’ jX + £ 

jXS + X 2 (z) 1 


7 A + 7 n — £ ’ 7 A + 771 — £y 


This additional level of completion means that the Markov chain will have 
difficulties to move around, compared with the original Gibbs sampling algo¬ 
rithm. While closer visits to the global mode are guaranteed in theory, they 
may require many more simulations in practice. 

6.12 Show that, for a normal mixture 0.5M^(0,1)-1-0.5 the likelihood 

is unbounded. Exhibit this feature by plotting the likelihood of a simulated sample 
using the R image procedure. 


This follows from the decomposition of the likelihood 

n 2 

£(e|x) = n Y.^.5f{x,\9,) , 

i=i j=i 

into a sum [over all partitions] of the terms 

n/(-.!«.,)= n »■(-<) n 

2=1 i;zi — l i;zi—2 

In exactly n of those 2” partitions, a single observation is allocated to the 
second component, i.e. there is a single i such that Zi = 2. For those particular 
partitions, if we choose ^ = Xi, the second product reduces to 1/cr which is 
not bounded when cr goes to 0. Since the observed likelihood is the sume of 
all those terms, it is bounded from below by terms that are unbounded and 
therefore it is unbounded. 

An R code illustrating this behaviour is 

# Scunple construction 
N=100 

sampl=rnorm(N)+(runif(N)<.3)*2.7 

# Grid 

mu=seq(-2.5,5.5,length=250) 

sig=rev(l/seq(.001,.01,length=250)) # inverse varicince 

mol=mu7,*7ot (repd , length=length(sig) ) ) 
mo2=(rep(l, length=length(mu) ) )7.*7ot (sig) 
cal=-0.5*mol~2*mo2 











ca2=mol*mo2 
ca3=sqrt(mo2) 
ca4=0.5*(l-mo2) 
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# Likelihood surface 

like=0*mol 

for (i in 1:N) 

like=like+log(l+exp(cal+Scmipl[i]*ca2+sampl[i]"2*ca4)*ca3) 
like=like-min(like) 

sig=rev(l/sig) 

image(mu,sig,like,xlab=expression(mu), 

ylab=expression(sigma"2),col=heat.colors(250)) 
contour(mu,sig,like,add=T,nlevels=50) 


and Figure [673| in this manual exhibits the characteristic stripes of an explosive 
likelihood as a approaches 0 for values of p, close to the values of the sample. 



a 


Fig. 6.3. Illustration of an unbounded mixture likelihood. 
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Dynamic Models 


7.1 Consider the process (xt)tgz defined by 

Xt = a+ bt + yt, 

where is an iid sequence of random variables with mean 0 and variance 

(T^, and where a and b are constants. Define 

wt = (2g + . 

Compute the mean and the autocovariance function of Show that 

is not stationary but that its autocovariance function + h,t) does 
not depend on t. 


We have 


E[wt] = E 


(2g + l)-i g 


Xt+j 


— (2(? + 1) ^ J]] E [a + b{t + j) + yt\ 


= a+ bt. 


The process {wt)t& is therefore not stationary. Moreover 
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E[wtwt+h] = E 


CL -\-bt-\- 


2q 


-y la + bt + bh+Y^ Vt+h+j 


3^-q 




= (a + bt){a -\-bt -\- bh) + E 


Vt+h+j 


3^-Q 3^-Q 


— (q- + bt)[a bt bh) H- II|/i|<g(^ + 1 — . 


Then, 

cov{wt, Wt+h) = +\<qiq + 1 - \h\)+ 

and, 

7u,(t + h,t) = l\h\<q{q + 1 - \h\)+ . 

7.2 Suppose that the process {xt)tGN is such that xq ~ .y4^(0,T^) and, for all 
t € N, 

Xt+i\xo,t ^ ■y^{xt/‘2,+), a>0. 

Give a necessary condition on for {xt)t^-M to be a (strictly) stationary process. 


We have 

E[a;i] = E[E[xi|a:o]] = E[a:o/2] = 0. 


Moreover, 


V(a;i) = V(E[xi|xo]) + E[V(xi|xo)] = + + + + . 

Marginaly, xi is then distributed as a M^(0, T^/4 + cr^) variable, with the same 
distribution as xq only if t^/4 + + = +, i.e. if + = 4(t^/3. 

7.3 Suppose that (xt)tgN is a Gaussian random walk on M: xq ~ and, 

for all t G N, 

Xt+i\xo-t ^ ^{xt,cr'^) , cr>0. 

Show that, whatever the value of is, {xt)t^N is not a (strictly) stationary 
process. 


We have 

E[a;i] = E[E[xi|xo]] = E[xo] = 0 . 

Moreover, 

V(a;i) = V(E[a;i |a;o]) + E[V( 2 ;i | xq)] = + + + . 

The marginal distribution of xi is then a 71^(0, distribution which 

cannot be equal to a c/K(0, r^) distribution. 
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7.4 Give the necessary and sufficient condition under which an AR(2) process 
with autoregressive polynomial V{u) = 1 — QxU — (with Q 2 yf 0) is causal. 


We have 

E[a;i] = E[E[a::i|a::o]] = E[xo/2] = 0. 


Moreover, 


V(a;i) = V(E[xi|xo]) + E[V(xi|xo)] = t^/4 + . 

Marginaly, Xi is then distributed as a c/K(0, t^/4 + ct^) variable, with the same 
distribution as xq only if = r^, i.e. if ji. 

7.5 Consider the process (xt)tgN such that xq = 0 and, for all t G N, 

Xt+i|xo:t ~ JV{QXt.,a^) . 

Suppose that 7r(£), tr) = l/cr and that there is no constraint on g. Show that the 
conditional posterior distribution of g, conditional on the observations Xqit and 
on cr^, is a jy{g,T,i^T) distribution with 



Show that the marginal posterior distribution of g is a Student 3'{T— 
distribution with 

4 = • 

\t=i ' t=o / 

Apply this modeling to the Aegon series in EurostoxxSO and evaluate its predic¬ 
tive abilities. 


The posterior conditional density of g is proportional to 

T 

J|exp{-(xt - gxt-if/2a‘^] 


oc exp 


T-l T-l 

X! 4 + 2£| ^ XtXt+i 




I, L t=0 J ) 

which indeed leads to a conditional distribution as indicated 

above. 
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Given that the joint posterior density of {g, a) is proportional to 

T 

I]^exp {-{xt - QXt-if/2a'^} 


integrating out a leads to a density proportional to 

J exp “ pxt-i?/ do- 

= y (o-^)”^^^”^exp {^^^{xt - pxt-if /{2a‘^^ da^ 


-T/2 


= s - gxt-if 


when taking into account the Jacobian. We thus get a Student ^7{T — 
distribution and the parameters can be derived from expanding 
the sum of squares: 


into 


T-l 


^{xt - QXt-if = ^x1 {g^ - 2gfXT) + 


Xt 


i=0 


T-l 


T-l 


Y^Uq- Prf +Y^t - Y 


2 2 
Xt IIt 


t=0 




{q - PT? , 1 I Et=i 


T- 1 

{g- Prf 
T- 1 


T-l 


y 


^1 a^t 2 ^ 

t=0 xt J 


The main point with this example is that, when g is unconstrained, the 
joint posterior distribution of (p, cr) is completely closed-form. Therefore, the 
predictive distribution of xt+i is given by 


1 

yyit 


exp{-(a;T+i - gxxY 12^} T:{a, £»|xo:T)dCTd£» 


which has again a closed-form expression: 
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/ J- exp{ - {xT+i - QXrf 7r(a, p|xo:T)dtTd£< 

J V27rcr 

f _ _ ^ 

oc / cr~^~^ exp{— y^(xt+i — gcct)^/2o'^}do'dg 

+_r» 


i=0 

s '^{xt+i - gxt)- 

[ t=0 

T \ -(T+1)/2 

2 1 / 


-(T+1)/2 


dp 


E 


cx I > 

\t=o 

/ T \ -(T+l)/2 

T T f T 

E^?E^E - sE 


(p-/XT+l)^ , ,,2 

1“ ’^T+l 


-(T+2)I2 


dp 


2\ (T+l)/2 


This is a Student T{T,6t,(^t) distribution, with 


T-l T-l 

St = xt'^ xtxt+il ^ xf = ptXt 

t=0 t=0 


and 


ojt 


= i E*‘E^E 


t=0 


t=0 



X 


2 

t ■ 


The predictive abilities of the model are thus in providing a point estimate for 
the next observation xt+i = PtXt, and a confidence band around this value. 


7.6 For Algorithm 7.13, show that, if the proposal on is a log-normal distribu¬ 
tion .ifo/K(log(cr^_]^), T^) and if the prior distribution on is the noninformative 
prior 7r((T^) = l/cr^, the acceptance ratio also reduces to the likelihood ratio 
because of the Jacobian. 

If we write the Metropolis-Hastings ratio for a current value Uq and a 
proposed value (t\, we get 


7r(g?K(cri) exp (-(log(crg - log(g^))V2 t^) /gg ^ £(a?) 

exp (-(log(crg - log(cr?))2/2T2) /al l{al) ’ 


as indicated. 








7 Dynamic Models 


7.7 Write down the joint distribution of in (7.19) and deduce that 

the (observed) likelihood is not available in closed form. 

Recall that yo ^ A/’(0, cr^) and, for f = 1,..., T, 


yt = tpyt-i + cre*_i, 

xt = (3ey*^'^et, 


where both Cj and are iid Af{0, 1) random variables. The joint distribution 
of (xi. 7 ’,yo:T) is therefore 

/ (Xl:T,yO:T) = / (xIitIYOit) / (YOit) 



Due to the double exponential term exp ( —x^ exp{—yt)j, it is im¬ 


possible to find a closed-form of the integral in yo:T. 

7.8 Show that the stationary distribution of x_p._i in an AR(p) model is a 
=ylj,(/rlp, A) distribution, and give a fixed point equation satisfied by the covari¬ 
ance matrix A. 

If we denote 


— (,Xt^ Xt — l^ ... 5 —p) ; 


then 


Zt-i-i — /rip + B {zt — /rip) -|- €t+i ■ 


Therefore, 


E [zt+i|zt] = flip + B{zt- flip) 


and 


cr2 0 ... 0 

0 0 ... 0 


V {zt+i\zt) = V (ct+i) 


= V. 


0 0 ... 0 
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Then, 


zt+i\zt - A/'p {filp + B{zt- fj,lp ), V) . 

Therefore, if z_i = x_p._i ^ JVp (flip, A) is Gaussian, then Zj is Gaussian. 
Suppose that z* ^ Np{M, A), we get 

E [zt+i) = /rip + B{M - /rip] 

and E [zt+i] = E [zt] if 

/rip + B (M -/rip) = M, 

which means that M = /rip. Similarly, V (zj+i) = V (zj) if and only if 

BAB' + V = A, 

which is the “fixed point” equation satisfied by A. 

7.9 Show that the posterior distribution on 6 associated with the prior tt{9) = 
l/cr^ and an AR(p) model is well-defined for T > p observations. 

The likelihood conditional on the initial values Xo:(p_i) is proportional to 



A traditional noninformative prior is 7r(/r, pi,..., pp, a^) = Xj. In that case, 
the probability density of the posterior distribution is proportional to 



And 



t—p I \ 2=1 / 

holds for r — p -(-1 > 0, i.e., T > p — 1. This integral is equal to 

(p_r-i)/2 



which is integrable in /r for T — p > 0, i.e. T > p. The other parameters gj 
(j = 1,... ,p0 being bounded, the remaining integrand is clearly integrable in 

Q- 
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7.10 Show that the coefficients of the polynomial V in (7.15) associated with 
an AR(p) model can be derived in O(p^) time from the inverse roots using the 
recurrence relations {i = 1,... ,p, j = 0 , ... ,p) 

V>S = 1 , , 

where V'o = 1 ^nd -i/;] = 0 for j > i, and setting gj = (j = 1 ,... ,p). 

Since 

p j 

J|(l - Xix) = 1 - X! ’ 

i=i j=i 

we can expand the Ihs one root at a time. If we set 

i i 

11(1 - Ajx) = , 

i=i j=o 

then 

7+1 i 

]J (1 - Xjx) = (1 - A,+ia:) J|(l - Xjx) 
i=i i=i 

i 

= (1 - \i+ix)y^^ip]x^ 

3=0 

i 

= 1 + - Xi+i'(p]_i)x^ - A*++Ja;*+^, 

i=i 

which establishes the = -ipj — Xi+itpj_i recurrence relation. 

This recursive process requires the allocation of i variables at the fth stage; 
the coefficients of V can thus be derived with a complexity of O(p^). 

7.11 Given the polynomial V in (7.5), the fact that all the roots are outside the 
unit circle can be determined without deriving the roots, thanks to the Schur- 
Cohn test. If + = 7^, a recursive definition of decreasing degree polynomials is 
(fc =p,..., 1) 

uAk-iiu) = Ak-iiu) - (fkAKu) , 
where Alp denotes the reciprocal polynomial Ap{u) = u^Ak-i[^/u). 

1. Give the expression of ipk in terms of the coefficients of +. 

2. Show that the degree of + is at most k. 

3. If am,k denotes the m-th degree coefficient in +., show that ak,k ^ 0 for 
A: = 0,... if, and only if, ao^k 7 ^ o-k^k for all k’s. 
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4. Check by simulation that, in cases when ak,k 7 ^ 0 for fc = 0,... ,p, the roots 
are outside the unit circle if, and only if, all the coefficients ak,k are positive. 


Note: The above exercise is somewhat of a mystery (!) in that we cannot 
remember how it ended up in this exercise list, being incorrect and incomplete 
as stated. A proper substitute is given below: 


7.11 Given a polynomial V of degree fc, its reciprocal polynomial is defined 
as 

r*{u) = u’^Vk-i{l/u). 

Assuming 7^(0) = 1, the Schur transform of V is defined by 

Viz)-V*{0)V*iz) 


TV{u) = 


1 -iP*(0)2 


1. Show that the roots of V and are inverses. 

2. Show that the degree of TV is at most k — 1. 

3. Show that TV{0) = 1. 

4. Check by a simulation experiment producing random polynomials the 
property that, when TV{0) > 1, TV and TV have the same number of 
roots inside the unit circle. 

5. Denote T^V = T(T^~^V), for d ^ k, and k the first index with V^V = 0. 
Deduce from the above property that, if T'^V > 0 for n = 1,..., k, then 
V has no root inside the unit circle. 


1. If we write the inverse root decomposition of V as 

k 

'Piu) = n(l - Ku) , 

i^l 

since 7^(0) = 1, we have 

k k k 

V*{u) = ]^(1 - Ku-^) = - A,) = J|(l - . 

2=1 2=1 2=1 

2. By definition, if V[u) = then 

k 

'P*{u) ='^ak-iu ’-, 

i=0 

V*{Q) = ak, and 
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fe-l 


V{u) - V*{Q)V*{u) = akU^ + aiU^ - 

k-1 


k-1 

Otk ^ ^ Clk—i^ 
i=l 


is at most of degree fc — 1. 

3. Since 

V{Q)-V*{Q)V*{0) = l-al, 

T'P(O) = 1. 

4. A simulation experiment can be designed around the following code: 

k=10 

# random coefficients 

Coef=c(l,runif(k,-l,l)) 

Schur=Coef-Coef[k] *rev(Coef ) 

print(sum(Mod(polyroot(Coef))<1)-sum(Mod(polyroot(Schur))<1)) 

Repeating this code a large number of times does not produce anything 
but zero’s. 

5. By virtue of the above result, V^TV,. ■. ,T'^~^V have the same number 
of roots inside the unit circle if T"7^ >0forn = l,...,K—1. Since 

= 1 - = 1 - 

the last root is outside the unit disk and hence so are the others. 

6. Extending the above code leads to 

k=10 

# Schur sequence 

Coef=matrix(0,nrow=k+l,ncol=k+l) 

# initial polynomial 

Coef [,k+l]=c(l,rnorm(k,sd=l/k)) 
for (t in k:1) 

Coef [1:t,t]=(Coef[1:(t+1),t+1]-Coef[t+1,t+1]*Coef[(t+1):1, 
t+l])/(l-Coef[t+1,t+1] ~ 2 ) 
while (prod(diag(Coef[1,]"2)<1)==0){ 

Coef=matrix(0,nrow=k+l,ncol=k+l) 

Coef[,k+l]=c(l,rnorm(k,sd=l/k)) 
for (t in k:1) 

Coef [1:t,t] = (Coef [1:(t+1),t+l]-Coef[t+1,t+l]*Coef[(t+1):1, 
t+1])/(l-Coef[t+1,t+1]~2) 

} 

print(min(Mod(polyroot(Coef[,k+l] )))) 

Repeated calls to this code consistently exhibit root modules larger than 

1 . 
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7.12 For an MA(( 7 ) process, show that (s<g) 

9-|s| 

'y2:(s) = ^ 'y ^ • 

i=0 


We have 


7^(s) = E[xta;t-s] 

= E [[ct + + . . . + 'dqet-q] [ct-s + + . . . + 'dq^t-s-q^ ■ 


Then, if 1 < s < g, 


lx{s) = + -i^s+l'i^l + . • . + 'dq'dq-s] 


and 

7.(0)= [l+t?2 + ...+^2] ^2^ 

Therefore, if (0 < s < g) with the convention that = 1 


7 .(s) = a 


q-s 

2=0 


2 + S ' 


The fact that 7 .(s) = 7 .(—s) concludes the proof. 

7.13 Show that the conditional distribution of (cq, ..., C-g+i) given both Xi-g- 
and the parameters is a normal distribution. Evaluate the complexity of computing 
the mean and covariance matrix of this distribution. 

The distribution of Xi:^ conditional on (cq, •. •, e-g+i) is proportional to 


o- ^nexpj 



Take 


(cq, . • . , C—g+l) ~ A/'g (Og, d Iq^ . 


In that case, the conditional distribution of (cq, ..., e_g+i) given Xi:t is pro¬ 
portional to 

n exp{-e2/2cr2} ]Jexp{-e2/2cr2} . 
i——q+l t—1 


Due to the recursive definition of Ct, the computation of the mean and the 
covariance matrix of this distribution is too costly to be available for realistic 
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values of T. For instance, getting the conditional mean of Ci requires deriving 
the coefficients of from all terms 



2 


by exploiting the recursive relation 

9 

et = xt- + 'djet-j ■ 
i=i 


If we write ei = di + PiCi and €t = /StCi, then we need to use the recursive 
formula 

9 9 

— Xt 4“ 'y ^ ; Pt — y ^ Pt—j ; 

i=i i=i 

before constructing the conditional mean of e^. The corresponding cost for this 
single step is therefore 0{Tq) and therefore O(gT^) for the whole series of Ci’s. 
Similar arguments can be used for computing the conditional variances. 

7.14 Give the conditional distribution of e_t given the other e_i's, and 
the Ci’s. Show that this distribution only depends on the other e_i's, Xi.q_t_|_i, 
and ei-q-t+i- 

The distribution of Xi-t’ conditional on (cq, ..., e_q+i) is proportional to 


Take 



(cq, ■ • ■ , £—g+l) ~ A/'q (Oqj ^ Iq) ■ 


In that case, the conditional distribution of (eg, ■ ■ ■, e-q+i) given xi:t is pro¬ 
portional to 

n exp{-e^/2cr^} ]Jexp{-e2/2CT2} . 

Z —— q+1 t — 1 


Due to the recursive definition of it, the computation of the mean and the 
covariance matrix of this distribution is too costly to be available for realistic 
values of T. For instance, getting the conditional mean of requires deriving 
the coefficients of from all terms 



7 Dynamic Models 


95 



by exploiting the recursive relation 


9 

= cct - /r + ^ ■ 

i=i 

If we write ei = + jSiei and 'et = St + /StCi, then we need to use the recursive 

formula 

9 9 

St — ^ ^ 5 l^t — ^ ^ (St—j 5 

i=i i=i 

before constructing the conditional mean of e^. The corresponding cost for this 
single step is therefore 0{Tq) and therefore O(gT^) for the whole series of e^’s. 
Similar arguments can be used for computing the conditional variances. 

7.15 Show that the (useful) predictive horizon for the MA(g) model is restricted 
to the first q future observations Xt+i- 

Obviously, due to the lack of correlation between xt+q+j (j > 0) and Xi:t 
we have 

E [xT+q+ilxnr] = E [xr+q+i] = 0 

and therefore the MA{q) model has no predictive ability further than horizon 

9- 

7.16 Show that the system of equations given by (7.13) and (7.14) induces a 
Markov chain on the completed variable (xt,yt)- Deduce that state-space models 
are special cases of hidden Markov models. 


Given the time-dependence structure 


xt = Gyt + et , 

yt+i = Pyt + $,t : 


we can write 



Since the noises ^t and Et are independent, the full vector (x 4 ,yt_|_i) is indeed 
a Markov chain. The subchain (yt) is also a Markov chain on itd own. And 
observing only xt means that we are observing a hidden Markov chain, in the 
sense of Figure 7.7 in the book. 
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7.17 Show that, for a hidden Markov model, when the support y is finite and 
when {yt)t^-M is stationary, the marginal distribution of xt is the same mixture 
distribution for all t's. Deduce that the same identifiability problem as in mixture 
models occurs in this setting. 


Since the marginal distribution of Xt is given by 

/ fi.xt\yt)'x{yt) dyt = ^ T^{y)f{xt\y ), 

v^y 

where tt is the stationary distribution of {yt), this is indeed a mixture distri¬ 
bution. Although this is not the fundamental reason for the unidentifiability 
of hidden Markov models, there exists an issue of label switching similar to 
the case of standard mixtures. 


7.18 Given a hidden Markov chain {xt,yt) with both xt and yt taking a finite 
number of possible values, k and k, show that the time required for the simulation 
of T consecutive observations is in 0(A:kT). 

Note: The order indicated in the exercise should be O(k^T), for the dis¬ 
tribution conditional on the observed Xt^s. 

For direct simulation, given the hidden chain at time t, yt, simulating yt+i 
requires up to k comparisons with a uniform variate. Given yt+i, simulating 
Xt+i involves another maximum of k comparisons with a uniform variate. 
Repeating those steps T times leads to a 0({fc -I- k}T) time. 

For inverse simulation, that is, after observing (cci,..., Xt), the joint con¬ 
ditional distribution of (i/i ,... ,yT) is given by 

p{yi, ■ ■ ■,yT\xi, ...,xt)(x po{yi)piy2\yi) ■ ■ ■p{yT\yT-i)p{xi\yi) ■ --pixTlyT), 

which takes values. 

However, if we use the backward formula described in the book, we 
could gain some time. If we get back to the defintion of the backward for¬ 
mula, the distribution of yx given the past being only conditional on yr-i, 
p(yr|yT_i, xqit), takes values. Then, for each previous hidden state, yt, 
p( 2 /t|?/t_i,X qiJ’) involves a summation of k terms for all pairs {yt-i,yt)- But 
the summation 

K 

^Pt+l(f|yt,Xl:T) 

i=l 

only depends on yt, thus has to be computed k times, to be later multiplied 
by Pyt-tyt- Therefore the cost of producing p(yt|2/(_i,X q-t) is again of order 
. At last, p(2/o|xo:t) requires k summations of n terms, thus is again of order 
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K^. This confirms that the overall cost is in 0{k^T) and that the number of 
possible values of the Xt’s is irrelevant. 

7.19 Implement Chib’s method of Section 6.8 in the case of a doubly finite 
hidden Markov chain. First, show that an equivalent to the approximation (6.9) 
is available for the denominator of (6.8). Second, discuss whether or not the label 
switching issue also rises in this framework. Third, apply this approximation to 

Dnadataset. 

In a hidden Markov model Vt being the hidden part, when the 

parameters are unknown, it is usually the case that the full posterior distri¬ 
bution of the parameter 7r(p,q|x, y) is available in closed form. In particular, 
as shown in Algorithm 7.15, this full posterior distribution is a product of 
K Beta distributions on the pi. ’s and of k Dirichlet distributions on the qi. ’s 
(* = 1 , 2 ). 

As alluded to in the book, it is also a setting where label switching occurs. 
Indeed, the introduction of states 1 and 2 in the hidden chain does not identify 
which state is which. The posteriors on and q^ should therefore be the same. 
Since the Gibbs sampler does not produce such symmetry on Figure 7.9, it is 
quite likely that Chib’s approximation will be biased in this setting. 

The implementation for Dnadataset of the Chib involves picking the 
highest likelihood value for 6 = (q^,q^,P) and averaging the full conditionals 
of 6 given the hidden chain over the Gibbs iterations. 

7.20 Show that the counterpart of the prediction filter in the Markov-switching 
case is given by 

t r K 

l0gp(xi:t) = ^log '^f{Xr\Xr-l,yr=i)Pr{'i) , 
where (prii) = ^{Vr = is given by the recursive formula 

K 

(Pr{i) (x'^PjJ{Xr-l\Xr-2,yr-l = j)(pr-l{j) ■ 

i=i 


This exercise is more or less obvious given the developments provided in 
the book. The distribution of yr given the past values Xi:j.-i is the marginal 
of (?/r,*/r-i) given the past values xi-^-i: 
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V{yr = ^ P(?/r = i, Ur-l = 

i=i 

K 

= = j|xi:r-l)P(j/r = *|j/r-l = j) 

i=i 

K 

= j,Xr-l\xi:r-2) 
i=i 

K 

= j,\y^l-.r- 2 )f{Xr-l\Xr- 2 ,yr-l = j) , 
i=i 

which leads to the update formula for the (prii)’- The marginal distribution 
xi:i is then derived by 


t 

^(xiii) = J|p(a;r|xi:(r_i)) 
r—1 
t K 

= = j,Xr\y^l:r-l) 

r=l j=l 
t K 

= ^ f{Xr\Xr-l,yr = i)Pr{i) , 

r=l j=l 

with the obvious convention pi(i) = tt^, if (tti, ... jTt^) is the stationary dis¬ 
tribution associated with P = (pij). 
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Image Analysis 


8.1 Find two conditional distributions f{x\y) and g{y\x) such that there is no 
joint distribution corresponding to both / and g. Find a necessary condition for / 
and g to be compatible in that respect; i.e., to correspond to a joint distribution 

on {x,y). 

As stated, this is a rather obvious question: if f{x\y) = 4y exp(—4j/a:) and if 
g{y\x) = 6xeyip{—6xy), there cannot be a joint distribution inducing these 
two conditionals. What is more interesting is that, if f{x\y) = 4?/exp(—dya;) 
and g{y\x) = Ax exp(—Ayx), there still is no joint distribution, despite the 
formal agreement between both conditionals: the only joint that would work 
has the major drawback that it has an infinite mass! 

8.2 Using the Hammersley-Clifford theorem, show that the full conditional dis¬ 
tributions given by (8.3) are compatible with a joint distribution. Deduce that the 
Ising model is a Markov random field. 

Note: In order to expose the error made in the earlier printing of Bayesian 
Core, namely using the size of the symmetrized neighborhood, Nk{i), in the 
full conditoinal, we will compute here the potential joint distribution based 
on the pseudo-conditional 



even though it is defined for Nk{i) = 1 in the book. 

It follows from (8.4) that, if there exists a joint distribution, it satisfies 


n—1 



p(yix,/?,fc)« n 
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Therefore, 

P(y|X, 13, A:) cx exp J /3 ^ | ^ (yi) - ly^ {y *)] + 

[ i=i \e<i,er^ki 

is the candidate joint distribution. Unfortunately, if we now try to derive the 
conditional distribution of yj from this joint, we get 



= Qly-oX,/?, fc) oc exp/3 


1 

Wj) 




iVj ) 

Nk{e) 


1 

mj) 


Y 

t-<3A~k3 


Y 

t<3A~k3 


1 

iVfc(A) J 


which differs from the orginal conditional if the Nk^jYs differ. In conclusion, 
there is no joint distribution if (8.3) is defined as in the earlier edition. Taking 
all the A^fe(//)’s equal to 1 leads to a coherent joint distribution since the last 
line in the above equation cancels. 


8.3 If a joint density 7r(?/i,..., j/„) is such that the conditionals TT{y-i\yi) never 
cancel on the supports of the marginals m-i{y-i), show that the support of tt is 
equal to the Cartesian product of the supports of the marginals. 


Let us suppose that the support of tt is not equal to the product of the 
supports of the marginals. (This means that the support of tt is smaller than 
this product.) Then the conditionals 'K(y-i\yi) cannot be positive everywhere 
on the support of m(y_i). 

8.4 Describe the collection of cliques C for an 8 neighbor neighborhood structure 
such as in Figure 8.2 on a regular n x m array. Compute the number of cliques. 


If we draw a detailed graph of the connections on a regular grid as in 
Figure [8T] in this manual, then the maximal structure such that all members 
are neighbors is made of 4 points. Cliques are thus made of squares of 4 points 
and there are (n — I) x (m — I) cliques on a n x m array. 

8.5 Draw the function Z{I3) for a 3 x 5 array. Determine the computational cost 
of the derivation of the normalizing constant Z{(3) of (8.4) for an m x n array. 
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Fig. 8.1. Neighborhood relations between the points of a 4 x 4 regular grid for a 
8 neighbor neighborhood structure. 

The function Z{P) is defined by 

Z{P) = 1 / X! 

which involves a summation over the set X of size 2^®. The R code corre¬ 
sponding to this summation is 

neigh=function(i,j){ #Neighbourhood indicator function 
(i==j+l)I I(i==j-l)I I(i==j+5)I I(i==j-5) 

} 

zee=function(beta){ 

val=0 

array=rep(0,15) 
for (i in 1;(2~15-1)){ 
expterm=0 
for (j in 1:15) 

expterm=expterm+smn((array==array[j])*neigh(i=l:15,j=j)) 
val=val+exp(beta*expterm) 

j=l 
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while (array[j]==l){ 
array[j]=0 

j=j+l > 

array [j]=l } 
expterm=0 
for (j in 1:15) 

expterm=expterm+smn((array==array[j])*neigh(i=l:15,j=j)) 
val=val+exp(beta*expterm) 

1/val } 


It produces the (exact) curve given in Figure 8.2 in this manual. 



P 

Fig. 8.2. Plot of the function Z{j3) for a 3 x 5 array with a four neighbor structure. 


In the case of a m x n array, the summation involves 2™^" and each 
exponential term in the summation requires (m x evaluations, which leads 
to a 0((to X n)^ 2’”^") overall cost. 

8.6 Show that the joint distribution (8.5) is indeed compatible with the full 
conditionals of the Potts model. Can you derive this Joint distribution from the 
Hammersley-Clifford representation (8.1)? 


If we defined the joint distribution as 
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7r(x) CX exp /3 ^ Ia;.=a 

the full conditional distribution of Xi is 
TT{xi = 5|x_i) (X 7r((g,x_i) 

/ 

(3 ^ Ia:„=x„ 


(8.5) 


oc exp 




\ U,v^i 


CX exp I (3 ^ 

U\ ir^U 

= exp {Prii^g) 

Conversely, if we start from the full conditionals 

Trixi = g|x_i) X exp(/3ni_g). i &1-, 3 < g < G ^ 
and apply the Hammersley-Clifford representation (8.1) 


7r(x) ^ Vr ■n{xi+i\x\,. . . ,X*,X^+ 2 , ■ ■ ■ ,Xn) 

7r(x*) Tt{xI^^\x\, . . . ,xl,X,+2, ■ ■ ■ ,Xn) 


we have 


tt{xx\x2, . 

■ • 7 ) 

T^{x\\x2, ■ 

■ ■ 5 ^n) 

■n{x2\xl,xs,. 

■•; ^n) 

Tt{x2\x\,X3,. 

■ ■ 5 ^n) 


&: 

■n{Xn\xl, . . . 

,<-i) 

7r(a;*|a:^... 

5 ^n-l) 


— exp I /? ^ ^ ail a:^] 


n; l~n 


— exp I /32 i^ 2 312 a;^] ^ ^ [^ai.^,— 3:2 ^ai.^,—a:^] 

\ ii>l; 2~u 


— exp I /? ^ ^ [^a;*—a;,! ^a:*—a 


which means that all terms involving both Xi and cancel out and that 


7r(x) X exp (3 ^ lxj=x 

This exercise is essentially the same as Exercise |8.9| 


(8.5) 
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8.7 For an nXTO array I, if the neighbourhood relation is based on the four near¬ 
est neighbors, show that the Xij’s for which (i+j) = 0(mod 2) are independent 
conditional on the Xij’s for which (i + j) = l(mod 2) {l<i<n, l<j< m). 
Deduce that the update of the whole image can be done in two steps by simu¬ 
lating the pixels with even sums of indices and then the pixels with odd sums of 
indices. (This modification of Algorithm 8.16 is a version of the Swendsen-Wang 
algorithm.) 


This exercise is simply illustrating in the simplest case the improvement 
brought by the Swendsen-Wang algorithm upon the Gibbs sampler for image 
processing. 

As should be obvious from Figure 8.7 in the book, the dependence graph 
between the nodes of the array is such that a given Xij is independent from 
all the other nodes, conditional on its four neighbours. When {i + j) = 0(2), 
the neighbours have indices {i,j) such that {i + j) = 1(2), which establishes 
the first result. 

Therefore, a radical alternative to the node-by-node update is to run a 
Gibbs sampler with two steps: a first step that updates the nodes Xij with 
even (i+j)’s and a step that updates the nodes Xij with odd (f-l-j)’s. This is 
quite a powerful solution in that it achieves the properties of two-stage Gibbs 
sampling, as for instance the Markovianity of the subchains generated at each 
step (see Robert and Gasella, 2004, Ghapter 9, for details). 

8.8 Determine the computational cost of the derivation of the normalizing con¬ 
stant of the distribution (8.5) for an n x m array and G different colors. 

Just as in Exercise |8.5[ finding the exact normalizing requires summing 
over all possible values of x, which involves terms. And each exponential 

term involves a sum over (m x n)^ terms, even though clever programing of 
the neighborhood system may reduce the computational cost down to m x n. 
Overall, the normalizing constant faces a computing cost of at least 0(m x 
n X G""^”). 

8.9 Use the Hammersley-Clifford theorem to establish that (8.5) is the joint 
distribution associated with the conditionals above. Deduce that the Potts model 
is an MRF. 

Similar to the resolution of Exercise |8.2[ using the Hammersley-Clifford 
representation (8.5) and defining an arbitrary order on the set I leads to the 
joint distribution 
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7r(x) oc 


exp |/3 '^Xi=Xj + J2j>i,jr^i '^Xi=x* I 

exp |/3 X^iGl ^X*=Xj + ^X*=X* I 


(X exp ^ 13 I ^ ^ ^Xi—Xj "i” ^ ^ ^Xi—x* ^ ^ ^x*—Xi 

\j~i,j<i j~id>i 3~i,3>i 


= exp 



So we indeed recover a joint distribution that is compatible with the initial full 
conditionals of the Potts model. The fact that the Potts is a MRF is obvious 
when considering its conditional distributions. 

8.10 Derive an alternative to Algorithm 8.17 w/here the probabilities in the multi¬ 
nomial proposal are proportional to the numbers of neighbors and compare 
its performance with that of Algorithm 8.17. 


In Step 2 of Algorithm 8.3, another possibility is to select the proposed 
value of Xui from a multinomial distribution 

Mg {ui)^ 

where n^g\u£) denotes the number of neighbors of ui that take the value g. 
This is likely to be more efficient than a purely random proposal, especially 
when the value of /3 is high. 


8.11 Show that the Swendsen-Wang improvement given in Exercise 8.7 also 
applies to the simulation of 7r(x|y, /3, cr^, /i,). 


This is kind of obvious when considering that taking into account the 
values of the ?/i’s does not modify the dependence structure of the Potts model. 
Therefore, if there is a decomposition of the grid I into a small number of 
sub-grids Ii,... such that all the points in Ij are independent from one 
another given the other X^’s, a k step Gibbs sampler can be proposed for the 
simulation of x. 


8.12 Using a piecewise-linear interpolation of f{(3) based on the values 
/(/3^),...,/(/3^), with 0 < /li < ... < Pm = 2, give the explicit value of 
the integral 
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for any pair 0 < ao < ai < 2. 


This follows directly from the R code in demo/Chapter. 8 .R as smnising, 
with 

P/(/3)ci/?« ^ fipm+i-Pr), 

i,ao<l3i<ai 

with the appropriate corrections at the boundaries. 


8.13 Show that the estimators x that minimize the posterior expected losses 
E’^[Li(x,x)|y)] and E'^[L 2 (x,x)|y] are and respectively. 


Since 

Ll(x,x) = , 

iGX 

the estimator x associated with Li is minimising 


E 


^ ^ ^Xi^xi |y 

-iGl 


and therefore, for every i G Xi minimizes P(a::i yf Xi), which indeed gives 
the MPM as the solution. Similarly, 


Z/ 2 (x, x) — Ix^x 


leads to x as the solution to 

mmE [lx#s|y] = minP (x yf x|y) , 

X X 

which means that x is the posterior mode. 

8.14 Determine the estimators x associated with two loss functions that penalize 
differently the classification errors, 

L3(x,x) = ^ Ixi=xjhi^x^ and L4(x,x) = ^ Ixi^xjhi=xj ■ 
ijex i,jei 


Even though L 3 and L 4 are very similar, they enjoy completely different 
properties. In fact, L 3 is basically useless because x = (1, • • • , 1) is always an 
optimal solution! 

If we now look at L 4 , we first notice that this loss function is invariant by 
permutation of the classes in x: all that matters are the groups of components 
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of X taking the same value. Minimizing this loss function then amounts to 
finding a clustering algorithm. To achieve this goal, we first look at the dif¬ 
ference in the risks when allocating an arbitrary Xi to the value a and when 
allocating Xi to the value b. This difference is equal to 

^ ¥{xi=Xj)- ^ ¥{xi = Xj). 

j,Xj=a j,Xj=b 

It is therefore obvious that, for a given configuration of the other x^’s, we 
should pick the value a that minimizes the sum x ~ ^j)- Once Xi 

is allocated to this value, a new index I is to be chosen for possible realloca¬ 
tion until the scheme has reached a fixed configuration, that is, no Xi need 
reallocation. 

This scheme produces a smaller risk at each of its steps so it does neces¬ 
sarily converge to a fixed point. What is less clear is that this produces the 
global minimum of the risk. An experimental way of checking this is to run 
the scheme with different starting points and to compare the final values of 
the risk. 


8.15 Since the maximum of 7r(x|y) is the same as that of 7r(x|y)” for every 
K € N, show that 

7r(x|y)'‘= y 7r(x,6»i|y)d6»i X ••• X y 7r(x, 6l«,|y) , (8.1) 

where 0i = {/3i, (1 < * < k). Deduce from this representation an opti¬ 

mization scheme that slowly increases k over iterations and that runs a Gibbs 
sampler for the integrand of (8.9) at each iteration. 


The representation (8.10) is obvious since 

y 7r(x,6»|y)d6»^ = J 7r(x, 6»|y) d6i x • • • x J 7r(x, 6»|y) dS 

= y 7r(x,6»i|y)d6»i X • • • X y 7r(x, 6»^|y) d6>„ 


given that the symbols 9i within the integrals are dummies. 

This is however the basis for the so-called SAME algorithm of Doucet, 
Godsill and Robert (2001), described in detail in Robert and Casella (2004). 


8.16 For the Ising model, show that the distribution (8.4) can be also defined 
as 
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when the number of neighbors is constant. 


Since 


we have 


r(x) oc exp 


r(x) oc exp /3 ^ lxj=xi=i + lxj=xi=-i 


= exp j p'^lxj=xi=i + P 


^ ^xi=xi=i 


= exp ^/3'^lxj=xi=ij exp(iV/3) 
if N denotes the number of connected pairs i ^ j ■ 

8.17 Show that the joint distribution (8.4) can be obtained from the full con¬ 
ditionals (8.3) by virtue of the Hammerseley-Clifford representation (8.1). 

This is a special case of Exercise [8^ since the Ising model is a Potts model 
with only two modalities. 

8.18 Show that the Ising distribution is symmetric in that inverting the color of 
all pixels does not change the probability (8.4). 


Given the definition of the Ising model as 


7r(x) cx exp 



(8.3) 


switching I’s and — I’s does not modify the right hand side and hence does 
not change 7r(x). 

8.19 For the Ising model, run a simulation experiment that should locate the 
limiting value of /3 above which almost all pixels are of the same color. Same 
question for the (negative) limiting value of P below which the image is a perfect 
checkerboard. 
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A possible approach used in the following code is to resort to simulated an¬ 
nealing, increasing progressively (3 until all sites are of the same color. Opting 
for a four-neighbour structure, we slightly modify the R functions 

xneig4=function(x,a,b,col){ 
n=dim(x)[1];m=dim(x)[2] 
nei=c(x[a-1,b]==col,x[a,b-l]==col) 
if (a!=n) 

nei=c(nei,x[a+1,b]==col) 
if (b!=m) 

nei=c(nei,x[a,b+l]==col) 
sum(nei) 

} 


isingibbs=function(niter=10"2,n,m=n,beta=l, 

x=matrix(scmiple(c(-1,1),n*m,rep=TRUE),n,m)){ 
for (i in l:niter){ 
sampll=sample(1:n) 
sampl2=sample(1:m) 
for (k in l:n){ 
for (1 in l:m){ 

n0=xneig4(x,Scmipll [k] ,sainpl2[l] ,-l) 
nl=xneig4(x,Scmipll [k] ,sainpl2[l] ,1) 

X [scunpll [k] , Scmipl2 [1] ] =sample (c(-l, 1) , 1, 
prob=exp(beta*c(nO,nl))) 

»} 

X 

> 

defined in the book. Then the function 

isinganeal=function(niter=10"3,precis=.1,n,m=n){ 
beta=precis 

simu=isingibbs(niter,n,m,beta) 
while (min(simu)<max(simu)){ 
beta=beta+precis 

simu=isingibbs(niter,n,m,beta,x=simu)} 
return(beta) 

} 

increases the coefficient /3 until all simulated entries are of the same color. 

Figure |8.3| in this manual provides an histogram of the /3’s returned by 
the above code in the case of a 5 x 5 grid. It gives indications on the zone to 
study more precisely the occurence of unicolor grids and the detection of the 
cutoff point. 

For the opposite case, the coefficient /3 is decreased in isinganeal until 
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P 


Fig. 8.3. Empirical distribution of the /3’s leading to a unicolor simulation of the 
Ising model, for a (5, 5) grid, based on 250 replications and a precision of 0.1. 


sumCabs(simu[,-1]+simu[,-m]))+smn(abs(simu[-1,]+sinm [-n,]))==0 

Figure |8.4| in this manual provides an histogram of the /3’s returned by the 
above code in the case of a 5 x 5 grid. As for Figure [STS] in this manual, it 
only provide some indications on the zone of /3’s for producing checker grids 
almost surely. 

8.20 Show that the ABC algorithm implemented with e = 0 and a distance 
between sufficient statistics is not approximate in that the output is truly simulated 
from the posterior distribution 7r(0|x) oc f{x.\9)TT{9). 


When the ABC algorithm is used with a tolerance e = 0, the probability of 
accepting 9 ^ Tr{9) in Algorithm 8.18 is Pe(S'(y) = ^(a;)) = f^{S{x)\9), the 
probability mass function of the statistic S{X) when X ~ f{x\9). Therefore 
the distribution of the accepted 9’s is 

7r^^^{9\x)^7r{9)fiSix)\9) 

which is the exact posterior distribution of 9 when observing S{x). If S{-) is a 
sufficient statistic, this posterior is also equal to the posterior distribution of 
9 given the observation x. Therefore, an ABC simulation of the Potts model 
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Fig. 8.4. Empirical distribution of the P’s leading to a checkerboard simulation of 
the Ising model, for a (5, 5) grid, based on 250 replications and a precision of 0.1. 


posterior in Section 8.3.3 could be rerun with a tolerance of e = 0, albeit at a 
higher computational cost. 





















